Various utilities for processing the data.
Switch branches/tags
Clone or download
Latest commit c116844 Sep 19, 2018
Permalink
Failed to load latest commit information.
compat argparse compat Jan 18, 2015
data Akkadian-specific files. Sep 19, 2018
example-data moved to the test suite Sep 11, 2014
test-cases Improving check_sentence_ids.pl Jun 8, 2018
v2-conversion Merge pull request #7 from UniversalDependencies/add-link Jul 2, 2017
.gitignore A fix and a comment. Jun 25, 2017
LICENSE.txt File name extensions. Oct 10, 2015
README.txt (validate.py) explain that it depends on file_util.py Jul 22, 2018
check_files.pl Two new treebanks in check_files.pl. Jun 29, 2018
check_overlaps.pl Executable. Apr 8, 2018
check_sentence_ids.pl Improving check_sentence_ids.pl Jun 8, 2018
check_text_wosp_match.sh Check text matching. May 8, 2017
conll_convert_tags_to_uposf.pl Improved convert script for CoNLL-U special lines, and extended README. May 6, 2016
conllu-align-tokens.pl Fixed multi-word tokens. Mar 10, 2018
conllu-copy-basic-to-enhanced.pl basic to enh Mar 18, 2018
conllu-formconvert.py Script for converting to various formats. Now just .dep Sep 21, 2015
conllu-sort-sentences-by-ids.pl Executable. Apr 8, 2018
conllu-stats.pl Bug fix: keys() on scalar. Aug 22, 2018
conllu-stats.py don't crash on empty nodes Feb 16, 2017
conllu-tenfold.pl Merge branch 'master' of github.com:UniversalDependencies/tools Mar 11, 2018
conllu-w2t.py Added shebang so users can call the script directly. Feb 15, 2016
conllu_to_conllx.pl Older tools may not be happy with spaces in words. Mar 10, 2017
conllu_to_text.pl These forgotten lines were eating memory for no good. Aug 13, 2018
create_iso_639_3_symlinks.py Adding option to copy, not symlink Dec 29, 2016
csort.pm Better sorting of the names of authors. May 2, 2016
evaluate_treebank.pl Three new genres. Apr 21, 2018
file_util.py Error reporting in file_util Nov 6, 2015
find_duplicate_sentences.pl Find duplicate sentences across files (file overlap). Jul 2, 2017
generate_comparison_of_treebanks.pl Renamed script. Nov 21, 2017
generate_treebank_hub.pl Tools adjusted to the new README format. Nov 2, 2017
klcpos3.pl Exception for Japanese merge script. And klcpos3.pl. May 20, 2018
mergept.pl Alexandre's files now use " = " in metadata comments. Jan 29, 2017
mwtoken-stats.pl Statistics of multi-word tokens. Jan 27, 2017
overlap.py Executable. Apr 8, 2018
package_st_data.sh New date. May 6, 2018
package_ud_release.sh Do not release CONTRIBUTING.md. Jun 21, 2018
remove_duplicate_sentences.pl Merge branch 'master' of github.com:UniversalDependencies/tools May 9, 2017
remove_sense_suffixes_from_lemmas.pl Punctuation must be checked first. Apr 5, 2018
restore_conllu_lines.pl Fixes #25. Apr 3, 2018
runtests.sh --no-tree-text not needed in runtests.sh anymore Feb 11, 2017
save_evaluation_logs.sh Helper script for treebank evaluation within one language. Feb 11, 2018
survey_deprel_subtypes.pl Surveying relation subtypes can be limited to a list of treebanks. Jun 28, 2018
survey_features.pl Fixed misleading message. Jun 28, 2018
text_without_spaces.pl Be more tolerant to mistakes in input format. Mar 12, 2017
udlib.pm Added download link to the treebank hub page. Aug 23, 2018
validate.py It does not work. Sep 19, 2018
validate_all.sh Czech should no longer need an exception to allow multi-root sentences. Jan 30, 2017
validate_repo_metadata.py prefix/postfix recognized by substrings Mar 2, 2017

README.txt

This repository contains various scripts in Perl and Python that can be used as tools for Universal Dependencies.



==============================
validate.py
==============================

Reads a CoNLL-U file and verifies that it complies with the UD specification. It must be run with the language
code and there must exist corresponding lists of treebank-specific features and dependency relations in order
to check that they are valid, too.

The script runs under Python 2.7 and needs both the third-party module "regex" and the "file_util" module found
in this repository. If you do not have the "regex" module, install it using "pip install regex".

  cat la_proiel-ud-train.conllu | python validate.py --lang la

You can run "python validate.py --help" for a list of available options.

==============================
check_sentence_ids.pl
==============================

Reads CoNLL-U files from STDIN and verifies that every sentence has a unique id in the sent_id comment. All files of
one treebank (repository) must be supplied at once in order to test treebank-wide id uniqueness.

  cat *.conllu | perl check_sentence_ids.pl



==============================
conllu-stats.py
conllu-stats.pl
==============================

Reads a CoNLL-U file, collects various statistics and prints them. These two scripts, one in Python and the other in
Perl, are independent of each other. The statistics they collect overlap but are not the same. The Perl script
(conllu-stats.pl) was used to generate the stats.xml files in each data repository.



==============================
mwtoken-stats.pl
==============================

Reads a CoNLL-U file, collects statistics of multi-word tokens and prints them.

  cat *.conllu | perl mwtoken-stats.pl > mwtoken-stats.txt



==============================
overlap.py
==============================

Compares two CoNLL-U files and searches for sentences that occur in both (verbose duplicates of token sequences). Some
treebanks, especially those where the original text had been acquired from the web, contained duplicate documents that
were found at different addresses and downloaded twice. This tool helps to find out whether one of the duplicates fell
in the training data and the other in development or test. The output has to be verified manually, as some “duplicates”
are repetitions that occur naturally in the language (in particular short sentences such as “Thank you.”)

The script can also help to figure out whether training-dev-test data split has been changed between two releases so
that a previously training sentence is now in test or vice versa. That is something we want to avoid.



==============================
find_duplicate_sentences.pl
remove_duplicate_sentences.pl
==============================

Similar to overlap.py but it works with the sentence-level
attribute “text”. It remembers all sentences from STDIN or from
input files whose names are given as arguments. The find script
prints the duplicate sentences (ordered by length and number of
occurrences) to STDOUT. The remove script works as a filter: it
prints the CoNLL-U data from the input, except for the second and
any subsequent occurrence of the duplicate sentences.



==============================
conllu_to_conllx.pl
==============================

Converts a file in the CoNLL-U format to the old CoNLL-X format. Useful with old tools (e.g. parsers) that require
CoNLL-X as their input. Usage:

  perl conllu_to_conllx.pl < file.conllu > file.conll



==============================
restore_conllu_lines.pl
==============================

Merges a CoNLL-X and a CoNLL-U file, taking only the CoNLL-U-specific lines from CoNLL-U. Can be used to merge the
output of an old parser that only works with CoNLL-X with the original annotation that the parser could not read.

  restore_conllu_lines.pl file-parsed.conll file.conllu



==============================
conllu_to_text.pl
==============================

Converts a file in the CoNLL-U format to plain text, word-wrapped to lines of 80 characters (but the output line will
be longer if there is a word that is longer than the limit). The script can use either the sentence-level text
attribute, or the word forms plus the SpaceAfter=No MISC attribute to output detokenized text. It also observes the
sentence-level newdoc and newpar attributes, and the NewPar=Yes MISC attribute, if they are present, and prints an
empty line between paragraphs or documents.

Optionally, the script takes the language code as a parameter. Codes 'zh' and 'ja' will trigger a different
word-wrapping algorithm that is more suitable for Chinese and Japanese.

Usage:

  perl conllu_to_text.pl --lang zh < file.conllu > file.txt



==============================
conll_convert_tags_to_uposf.pl
==============================

This script takes the CoNLL columns CPOS, POS and FEAT and converts their combined values to the universal POS tag and
features.

You need Perl. On Linux, you probably already have it; on Windows, you may have to download and install Strawberry Perl.
You also need the Interset libraries. Once you have Perl, it is easy to get them via the following (call "cpan" instead
of "cpanm" if you do not have cpanm).

  cpanm Lingua::Interset

Then use the script like this:

  perl conll_convert_tags_to_uposf.pl -f source_tagset < input.conll > output.conll

The source tagset is the identifier of the tagset used in your data and known to Interset. Typically it is the language
code followed by two colons and "conll", e.g. "sl::conll" for the Slovenian data of CoNLL 2006. See the tagset conversion
tables at http://universaldependencies.github.io/docs/tagset-conversion/index.html for more tagset codes.

IMPORTANT:
The script assumes the CoNLL-X (2006 and 2007) file format. If your data is in another format (most notably CoNLL-U, but
also e.g. CoNLL 2008/2009, which is not identical to 2006/2007), you have to modify the data or the script. Furthermore,
you have to know something about the tagset driver (-f source_tagset above) you are going to use. Some drivers do not
expect to receive three values joined by TAB characters. Some expect two values and many expect just a single tag,
perhaps the one you have in your POS column. These factors may also require you to adapt the script to your needs. You
may want to consult the documentation at https://metacpan.org/pod/Lingua::Interset. Go to Browse / Interset / Tagset,
look up your language code and tagset name, then locate the list() function in the source code. That will give you an
idea of what the input tags should look like (usually the driver is able to decode even some tags that are not on the
list but have the same structure and feature values).



==============================
check_files.pl
==============================

This script must be run in a folder where all the data repositories (UD_*) are stored as subfolders. It checks the
contents of the data repositories for various issues that we want to solve before a new release of UD is published.