Skip to content

Latest commit



91 lines (61 loc) · 4.49 KB


File metadata and controls

91 lines (61 loc) · 4.49 KB

Getting Started


pyconll is a low level wrapper around the CoNLL-U format. This document explains how to quickly get started loading and manipulating CoNLL-U files within pyconll, and will go through a typical end-to-end scenario.

To install the library, run pip install pyconll from your python enlistment.

Loading CoNLL-U

To start, a CoNLL-U resource must be loaded, and pyconll can load from files, urls, and strings. Specific API information can be found in the load module documentation. Below is a typical example of loading a file on the local computer.

import pyconll

my_conll_file_location = './ud/train.conllu'
train = pyconll.load_from_file(my_conll_file_location)

Loading methods usually return a Conll object, but some methods return an iterator over Sentences and do not load the entire Conll object into memory at once. The Conll object satisfies the MutableSequence contract in python, which means it functions nearly the same as a list.

Traversing CoNLL-U

After loading a CoNLL-U file, we can traverse the Conll structure. Conll objects wrap Sentences and Sentences wrap Tokens. Here is what traversal normally looks like.

for sentence in train:
    for token in sentence:
        # Do work within loops

Statistics such as lemmas for a certain closed class POS or number of non-projective punctuation dependencies can be computed through these loops. As an abstract example, we have defined some predicate, sentence_pred, and some transformation of noun tokens, noun_token_transformation, and we wish to transform all nouns in sentences that match our predicate, we can write the following.

for sentence in train:
    if sentence_pred(sentence):
        for token in sentence:
            if token.pos == 'NOUN':

Note that most objects in pyconll are mutable, except for a select few fields, so changes on the Token object remain with the Sentence and can be output back into CoNLL format when processing is complete.

Outputting CoNLL-U

Once you are done working with a Conll object, you may need to output your results. The object can be serialized back into the CoNLL-U format, through the conll method. Conll, Sentence, and Token objects are all Conllable which means they have a corresponding conll method which serializes the objects into the appropriate string representation.

A more efficient way of outputting an entire Conll file would be to use the write method, which prevents creating the entire Conll file string in memory. When creating the file to write to, remember that, CoNLL-U is UTF-8 encoded.

Complete example

Putting together all the above elements, a complete example from loading, to transformation, to output looks as follows.

import pyconll

# Load file
my_conll_file_location = './ud/train.conllu'
train = pyconll.load_from_file(my_conll_file_location)

# Process and transform
for sentence in train:
    if sentence_pred(sentence):
        for token in sentence:
            if token.pos == 'NOUN':

# Output changes. This writes directly to file, an alternative is to use
# train.conll() which will return the entire output string at once.
with open('output.conllu', 'w', encoding='utf-8') as f:


pyconll allows for easy CoNLL-U loading, traversal, and serialization. Developers can define their own transformation or analysis of the loaded CoNLL-U data, and pyconll handles all the parsing and serialization logic. There are still some parts of the library that are not covered here such as the Tree data structure, loading files from resources other than files, and error handling, but the information on this page will get developers through the most important use cases.