### Luigi Talamo  

Language Science and Technology

Saarland University (Germany)

luigi.talamo@uni-saarland.de

Code available from https://github.com/rahonalab/miniciepplus_workshop

Check also these script: https://github.com/rahonalab/doing_stuff_with_ciepplus which are used for actual papers !

# Experimenting with miniciep+

Welcome to this interactive Jupyter notebook! I am Luigi, your teacher for today! 

Now that you know the story of miniciep+, let's start writing some code to interact with this wonderful parallel corpus.

## Python libraries for working with the CoNNL-U format
Since we are working with a collection of CoNNL-U files, we will use two Python libraries:

* conllu: https://pypi.org/project/conllu/
* pyconll: https://pyconll.readthedocs.io/en/stable/

Please install them using your packet manager. For instance:

`pip3 install conllu|pyconll`

or

`conda install conllu|pyconll`

Both libraries do more or less the same thing i.e., handling these intimidating CoNNL-U files for you by providing methods and functions to explore them easily and focus on your research. As we will see in this tutorial, each of the two libraries is sometimes better suited to a specific task.

## First steps
Let's start with pyconll.
First, import the pyconll library:

In [1]:
import pyconll

then, you can load a conllufile with the ``pyconll.load_from_file`` function. Let's list files with the English version of the Nome della Rosa:

In [2]:
en_nomerosa = pyconll.load_from_file("conllu/en_nomerosa.conllu")


## Walk through sentences and tokens of a CoNNL-U file
Each pyconll object like our ``en_nomerosa`` is a list of sentences. In turn, each sentence is a list of tokens.
There are basically two ways of exploring and collecting data from a sentence:
* a linear walk: we walk through the sentence following the linear ordering of the tokens (i.e., a flat list of tokens);
* a tree walk, we traverse the sentence using the tree structure described by the dependency relation (i.e. a nested tree structure of tokens).

The method provided by the ``pyconll`` library allows for a linear walk, for the hierarchical walk we will use the `conll` library.

### Linear walk
Let's start with the linear walk, printing all tokens from the file:

In [18]:
for sentence in en_nomerosa:
    for token in sentence:
        # Print tokens
        print(token)

Not very useful, isn't it? That's because we have only printed the object.
The `pyconll` library stores each of the UD annotation fields as an attribute of the object, e.g. `token.form` contains the graphemic form of the token. 
These fields are: ``id``, ``form``, ``lemma``, ``upos``, ``xpos``, ``feats``, ``head``, ``deprel``, ``deps``, ``misc``. 
Remember the fields we saw in the slides? We can loop over the CoNNL-U file and print them. For instance, let's print the four fields we have seen before, plus the ID (position number) and the graphemic form of the token:

In [20]:
for sentence in en_nomerosa:
    for token in sentence:
        # Print tokens
        print("Token no. "+str(token.id)+" has form "+str(token.form)+" UPOS: "+str(token.upos)+", FEATS: "+str(token.feats)+", DEPREL:"+str(token.deprel)+" and its head is token "+str(token.head))

Token no. 1 has form Umberto UPOS: PROPN, FEATS: {}, DEPREL:obl and its head is token 19
Token no. 2 has form Eco UPOS: PROPN, FEATS: {}, DEPREL:flat:name and its head is token 1
Token no. 3 has form - UPOS: PUNCT, FEATS: {}, DEPREL:punct and its head is token 1
Token no. 4 has form Il UPOS: DET, FEATS: {'Definite': {'Def'}, 'Gender': {'Masc'}, 'Number': {'Sing'}, 'PronType': {'Art'}}, DEPREL:det and its head is token 5
Token no. 5 has form nome UPOS: NOUN, FEATS: {'Gender': {'Masc'}, 'Number': {'Sing'}}, DEPREL:nsubj:pass and its head is token 19
Token no. 6-7 has form della UPOS: None, FEATS: {}, DEPREL:None and its head is token None
Token no. 6 has form di UPOS: ADP, FEATS: {}, DEPREL:case and its head is token 8
Token no. 7 has form la UPOS: DET, FEATS: {'Definite': {'Def'}, 'Gender': {'Fem'}, 'Number': {'Sing'}, 'PronType': {'Art'}}, DEPREL:det and its head is token 8
Token no. 8 has form rosa UPOS: NOUN, FEATS: {'Gender': {'Fem'}, 'Number': {'Sing'}}, DEPREL:nmod and its head is

We can easily traverse the CoNNL-U file and collect simple statistics like 'How many nouns are there in the corpus?' or more complex ones like 'How many adjectives do behave as direct objects?'

In [4]:
noun = 0
adj_subj = 0
for sentence in en_nomerosa:
        for token in sentence:
            if token.upos == 'NOUN':
                noun += 1
            if token.upos == "ADJ" and token.deprel == "obj":
                adj_subj += 1
print("There are "+str(noun)+" nouns."+" There are also "+str(adj_subj)+" adjectives behaving as direct objects.")

There are 63 nouns. There are also 1 adjectives behaving as direct objects.


### Tree walk
Unlike `pyconll`, the `conllu` library does not have a native method to load file, so you have to use the basic `open()` function:

In [14]:
from io import open
data = open("conllu/en_nomerosa.conllu", "r", encoding="utf-8")


The `conllu` library has a method for the linear walk too, which is called `parse()`, but here we are interested in the other method, `parse_tree()`:

In [15]:
import conllu
sentences = conllu.parse_tree(data.read())
for sentence in sentences:
    print(sentence)

TokenTree<token={id=17, form=HANDED}, children=[...]>
TokenTree<token={id=13, form=claimed}, children=[...]>
TokenTree<token={id=15, form=entertained}, children=[...]>
TokenTree<token={id=6, form=invaded}, children=[...]>
TokenTree<token={id=2, form=managed}, children=[...]>
TokenTree<token={id=9, form=read}, children=[...]>
TokenTree<token={id=8, form=reached}, children=[...]>
TokenTree<token={id=13, form=found}, children=[...]>
TokenTree<token={id=5, form=left}, children=[...]>
TokenTree<token={id=10, form=decided}, children=[...]>


We have printed the 10 sentences contained in our files. The form listed for each sentence is the root element, which has dependents, and dependents have dependents too, and so on. To list dependents (children), just use the `.children` attribute. Unlike `pyconll`, the `conllu` library:

In [19]:
from conllu import parse_tree
data = open("conllu/en_nomerosa.conllu", "r", encoding="utf-8")
sentences = conllu.parse_tree(data.read())
for sentence in sentences:
    print("Analyzing sentence "+sentence.metadata["sent_id"])
    for token in sentence.children:
        print(token)

Analyzing sentence 0
TokenTree<token={id=2, form=NAME}, children=[...]>
TokenTree<token={id=15, form=I}, children=None>
TokenTree<token={id=16, form=WAS}, children=None>
TokenTree<token={id=19, form=BOOK}, children=[...]>
TokenTree<token={id=57, form=.}, children=None>
Analyzing sentence 1
TokenTree<token={id=1, form=Supplemented}, children=[...]>
TokenTree<token={id=12, form=book}, children=[...]>
TokenTree<token={id=15, form=reproduce}, children=[...]>
TokenTree<token={id=60, form=.}, children=None>
Analyzing sentence 2
TokenTree<token={id=3, form=discovery}, children=[...]>
TokenTree<token={id=16, form=me}, children=None>
TokenTree<token={id=21, form=Prague}, children=[...]>
TokenTree<token={id=28, form=.}, children=None>
Analyzing sentence 3
TokenTree<token={id=3, form=later}, children=[...]>
TokenTree<token={id=5, form=troops}, children=[...]>
TokenTree<token={id=9, form=city}, children=[...]>
TokenTree<token={id=10, form=.}, children=None>
Analyzing sentence 4
TokenTree<token={id

## Build a csv file out of the results

Rather than printing out results on the screen, it is perhaps more useful to store them in a comma-separated file (CSV) and further process it with other software such as R. We would also like to work on all the five CoNNL-U files of our dataset, in order to extract data from different languages. First, create a CSV file to store our data. Then, let's give a name to each column:

In [5]:
import csv
mycsv = open("report.csv", 'a+', newline='',encoding='utf-8')
countwriter = csv.writer(mycsv, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
countwriter.writerow(['book','token','sentence_id','upos','features'])

48

Let's loop over our five files, loading and extract from each conllu file all instances behaving as subject (or other dependency relations). We will write all instances in a CSV file with their annotation on parts of speech and morphological features:

In [6]:
import glob
for conllufile in sorted(glob.glob('conllu/*.conllu')):
        conllu = pyconll.load_from_file(conllufile)
        for sentence in conllu:
            for token in sentence:
                if token.deprel == "obj":
                    countwriter.writerow([conllufile,token.form,sentence.id,token.upos,token.feats])