Various utilities for processing the data.
Python Perl Shell Smalltalk NewLisp Slash
Failed to load latest commit information.
compat argparse compat Jan 18, 2015
data Tokens with whitespace validation: Jan 12, 2017
example-data moved to the test suite Sep 11, 2014
test-cases fix test case Nov 23, 2014
v2-conversion Add first version of v2 conversion script. Jan 13, 2017
.gitignore gitignore Sep 10, 2014
LICENSE.txt File name extensions. Oct 10, 2015
README.txt Improved convert script for CoNLL-U special lines, and extended README. May 6, 2016 Language codes vs. treebank codes. Nov 15, 2016 Improved convert script for CoNLL-U special lines, and extended README. May 6, 2016 Script for converting to various formats. Now just .dep Sep 21, 2015 New languages and a bug fix. Nov 15, 2016 adding err message to conllu-stats Oct 6, 2015 Added shebang so users can call the script directly. Feb 15, 2016 Comment. Nov 18, 2015 Adding option to copy, not symlink Dec 29, 2016 Better sorting of the names of authors. May 2, 2016 Error reporting in file_util Nov 6, 2015 Calculate overlap between sets Nov 6, 2015 Packaging script can now handle a single treebank (late fixes). Nov 21, 2016 run with --no-lists Sep 11, 2014 Automatic generation of the list of language-specific relations. Sep 8, 2016 Executable. Sep 18, 2016 1.3 is no longer the current release. Nov 21, 2016 Tokens with whitespace validation: Jan 12, 2017 Minor fixes plus multiroot exception for Czech May 7, 2015


This repository contains various scripts in Perl and Python that can be used as tools for Universal Dependencies.


Reads a CoNLL-U file and verifies that it complies with the UD specification. It must be run with the language /
treebank code and there must exist corresponding lists of treebank-specific features and dependency relations in order
to check that they are valid, too.

  cat la_proiel-ud-train.conllu | --lang la_proiel


Reads a CoNLL-U file, collects various statistics and prints them. These two scripts, one in Python and the other in
Perl, are independent of each other. The statistics they collect overlap but are not the same. The Perl script
( was used to generate the stats.xml files in each data repository.


Compares two CoNLL-U files and searches for sentences that occur in both (verbose duplicates of token sequences). Some
treebanks, especially those where the original text had been acquired from the web, contained duplicate documents that
were found at different addresses and downloaded twice. This tool helps to find out whether one of the duplicates fell
in the training data and the other in development or test. The output has to be verified manually, as some “duplicates”
are repetitions that occur naturally in the language (in particular short sentences such as “Thank you.”)

The script can also help to figure out whether training-dev-test data split has been changed between two releases so
that a previously training sentence is now in test or vice versa. That is something we want to avoid.


Converts a file in the CoNLL-U format to the old CoNLL-X format. Useful with old tools (e.g. parsers) that require
CoNLL-X as their input. Usage:

  perl < file.conllu > file.conll


This script takes the CoNLL columns CPOS, POS and FEAT and converts their combined values to the universal POS tag and

You need Perl. On Linux, you probably already have it; on Windows, you may have to download and install Strawberry Perl.
You also need the Interset libraries. Once you have Perl, it is easy to get them via the following (call "cpan" instead
of "cpanm" if you do not have cpanm).

  cpanm Lingua::Interset

Then use the script like this:

  perl -f source_tagset < input.conll > output.conll

The source tagset is the identifier of the tagset used in your data and known to Interset. Typically it is the language
code followed by two colons and "conll", e.g. "sl::conll" for the Slovenian data of CoNLL 2006. See the tagset conversion
tables at for more tagset codes.

The script assumes the CoNLL-X (2006 and 2007) file format. If your data is in another format (most notably CoNLL-U, but
also e.g. CoNLL 2008/2009, which is not identical to 2006/2007), you have to modify the data or the script. Furthermore,
you have to know something about the tagset driver (-f source_tagset above) you are going to use. Some drivers do not
expect to receive three values joined by TAB characters. Some expect two values and many expect just a single tag,
perhaps the one you have in your POS column. These factors may also require you to adapt the script to your needs. You
may want to consult the documentation at Go to Browse / Interset / Tagset,
look up your language code and tagset name, then locate the list() function in the source code. That will give you an
idea of what the input tags should look like (usually the driver is able to decode even some tags that are not on the
list but have the same structure and feature values).


This script must be run in a folder where all the data repositories (UD_*) are stored as subfolders. It checks the
contents of the data repositories for various issues that we want to solve before a new release of UD is published.