### Luigi Talamo  

Language Science and Technology

Saarland University (Germany)

luigi.talamo@uni-saarland.de

Code available from https://github.com/rahonalab/dhw-1/

Check also these script: https://github.com/rahonalab/doing_stuff_with_ciepplus which are used for actual papers !

# Experimenting with miniciep+

Welcome to this interactive Jupyter notebook!

Now that you know the story of miniciep+, let's start writing some code to interact with this wonderful parallel corpus.

## Python libraries for working with the CoNNL-U format
There are at least three libraries for working with the CoNLL-U format:

* conllu: https://pypi.org/project/conllu/
* pyconll: https://pyconll.readthedocs.io/en/stable/
* udapi: https://udapi.readthedocs.io/

I used to employ the first two libraries, then Andy told me about udapi. 
Udapi is unfortunately poorly documented, but is a very powerful library. With the present tutorial, I hope to contribute a bit to its documentation. Let's install it with

`pip3 install udapi`

or

`conda install udapi`


## First steps
First, let's import the udapi library in our Python script

In [1]:
import udapi

then, you can load a conllufile with the ``pyconll.load_from_file`` function. In the conllu/ folder, you find five connlu files containing the first sentences of Eco's Il nome della Rosa, in the original Italian version and in four Romance translations. Let's load the original Italian version:

In [79]:
nomerosa = udapi.Document("conllu/it_nomerosa.conllu")

## Walk through sentences and tokens of a CoNNL-U file
Each udapi object like our ``it_nomerosa`` contains a list of sentences, which are callable through its ``.bundle`` attribute.
In turn, each sentence is a list of tokens, which are callable through its ``.nodes`` attribute.
There are basically two ways of exploring and collecting data from a sentence:
* a linear walk: we walk through the sentence following the linear ordering of the tokens (i.e., a flat list of tokens);
* a tree walk, we traverse the sentence using the tree structure described by the dependency relation (i.e. a nested tree structure of tokens).

With respect to other libraries, the same udapi object allows for both ways of exploring.

### Linear walk
Let's start with the linear walk, printing all tokens from the file:

In [15]:
for sentence in nomerosa.bundles:
    for token in sentence.nodes:
        # Print tokens
        print(token)

<0#1, Umberto>
<0#2, Eco>
<0#3, Il>
<0#4, nome>
<0#5, della>
<0#6, rosa>
<0#7, UN>
<0#8, MANUSCRIT>
<0#9, ,>
<0#10, NATURELLEMENT>
<0#11, .>
<1#1, Le>
<1#2, 16>
<1#3, août>
<1#4, 1968>
<1#5, ,>
<1#6, on>
<1#7, me>
<1#8, mit>
<1#9, dans>
<1#10, les>
<1#11, mains>
<1#12, un>
<1#13, livre>
<1#14, dû>
<1#15, à>
<1#16, la>
<1#17, plume>
<1#18, d'>
<1#19, un>
<1#20, certain>
<1#21, abbé>
<1#22, Vallet>
<1#23, ,>
<1#24, Le>
<1#25, Manuscrit>
<1#26, de>
<1#27, Dom>
<1#28, Adso>
<1#29, de>
<1#30, Melk>
<1#31, ,>
<1#32, traduit>
<1#33, en>
<1#34, français>
<1#35, d'>
<1#36, après>
<1#37, l'>
<1#38, édition>
<1#39, de>
<1#40, Dom>
<1#41, J.Mabillon>
<1#42, (>
<1#43, à>
<1#44, les>
<1#45, Presses>
<1#46, de>
<1#47, l'>
<1#48, Abbaye>
<1#49, de>
<1#50, la>
<1#51, Source>
<1#52, ,>
<1#53, Paris>
<1#54, ,>
<1#55, 1842>
<1#56, )>
<1#57, .>
<2#1, Le>
<2#2, livre>
<2#3, ,>
<2#4, accompagné>
<2#5, d'>
<2#6, indications>
<2#7, historiques>
<2#8, en>
<2#9, vérité>
<2#10, fort>
<2#11, mince>
<2#12, ,>
<2#13

Our library stores each of the UD annotation fields as an attribute of the object, e.g. `token.form` contains the graphemic form of the token. 
These fields are: ``ord``, ``form``, ``lemma``, ``upos``, ``xpos``, ``feats``, ``parent``, ``deprel``, ``deps``, ``misc``. 
Remember the fields we saw in the slides? We can loop over the CoNNL-U file and print them. For instance, let's print the four fields we have discussed in the slides, plus the ID (position number: ``ord``) and the graphemic form of the token:

In [90]:
for sentence in nomerosa.bundles:
    for token in sentence.nodes:
        # Print tokens
        print("Token no. "+str(token.ord)+" has form "+str(token.form)+" UPOS: "+str(token.upos)+", FEATS: "+str(token.feats)+", DEPREL:"+str(token.deprel)+" and its head is token "+str(token.parent))

('Token no. 1 has form Numele UPOS: NOUN, FEATS: '
 'Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing, DEPREL:nsubj and its head '
 'is token <0#19, căzut>')
('Token no. 2 has form trandafirului UPOS: NOUN, FEATS: '
 'Case=Dat,Gen|Definite=Def|Gender=Masc|Number=Sing, DEPREL:nmod and its head '
 'is token <0#1, Numele>')
('Token no. 3 has form - UPOS: PUNCT, FEATS: _, DEPREL:punct and its head is '
 'token <0#4, Umberto>')
('Token no. 4 has form Umberto UPOS: PROPN, FEATS: _, DEPREL:appos and its '
 'head is token <0#1, Numele>')
('Token no. 5 has form Eco UPOS: PROPN, FEATS: _, DEPREL:flat and its head is '
 'token <0#4, Umberto>')
('Token no. 6 has form Fireşte UPOS: PROPN, FEATS: _, DEPREL:flat and its head '
 'is token <0#4, Umberto>')
('Token no. 7 has form , UPOS: PUNCT, FEATS: _, DEPREL:punct and its head is '
 'token <0#9, manuscris>')
('Token no. 8 has form un UPOS: DET, FEATS: '
 'Case=Acc,Nom|Gender=Masc|Number=Sing|PronType=Ind, DEPREL:det and its head '
 'is token <0#9, m

We can easily traverse the CoNNL-U file and collect simple statistics like 'How many verbs are there in the corpus?' or more complex ones like 'How many personal pronouns do behave as direct objects?'

In [85]:
verb = 0
pron_obj = 0
prons = []
for sentence in nomerosa.bundles:
        for token in sentence.nodes:
            if token.upos == 'VERB':
                verb += 1
            if token.upos == "PRON" and token.deprel == "obj" and "Prs" in token.feats["PronType"]:
                pron_obj += 1
                prons.append(token.form)
print("There are "+str(verb)+" verbs."+" There are also "+str(pron_obj)+" personal pronouns behaving as direct objects. They are "+str(prons))

There are 37 verbs. There are also 3 personal pronouns behaving as direct objects. They are ['me', "m'", "l'"]


### Tree walk
To traverse the CoNLL-U file in a hierarchical way, we just have to use the ``children`` statement

In [29]:
for sentence in nomerosa.bundles:
    for token in sentence.nodes:
        print(str(token.form)+" has the following dependencies "+str(token.children))


Umberto has the following dependencies [Node<0#2, Eco>, Node<0#3, Il>, Node<0#4, nome>, Node<0#5, della>, Node<0#6, rosa>, Node<0#7, UN>, Node<0#8, MANUSCRIT>, Node<0#10, NATURELLEMENT>, Node<0#11, .>]
Eco has the following dependencies []
Il has the following dependencies []
nome has the following dependencies []
della has the following dependencies []
rosa has the following dependencies []
UN has the following dependencies []
MANUSCRIT has the following dependencies []
, has the following dependencies []
NATURELLEMENT has the following dependencies [Node<0#9, ,>]
. has the following dependencies []
Le has the following dependencies []
16 has the following dependencies [Node<1#1, Le>, Node<1#3, août>, Node<1#5, ,>]
août has the following dependencies [Node<1#4, 1968>]
1968 has the following dependencies []
, has the following dependencies []
on has the following dependencies []
me has the following dependencies []
mit has the following dependencies [Node<1#2, 16>, Node<1#6, on>, Node<

Let's write now a simple script to compute the word order patterns of adjectives, relative clauses and determiners across our five languages. First we loop over the folder where the CoNLL-U files are stored and load them as ``udapi`` object:

In [None]:
for conllufile in sorted(glob.glob("conllu/*.conllu")):
    ud_doc = udapi.Document(conllufile)

Then, we traverse the CoNLL-U files in a linear way, and we apply a very simple algorithm:
* check whether the token is a NOUN (we use UPOS);
* if yes, we determine if it's modified by an adjective, a relative clause and a determiner (we use deprel);
* then, determine the word order: if the id of the head is higher than the id of the modifier, then we have a mod-head pattern, otherwise we have a head-mod pattern.
We set a python dictionary, ``results``, to take note of the results.

In [89]:
import glob
import pprint #better printing
results = {}
for conllufile in sorted(glob.glob("conllu/*.conllu")):
    results[conllufile] = {
    "adjectives": {
        "h-m": int(0), 
        "m-h": int(0)
    },
    "relative clauses": {
        "h-m": int(0), 
        "m-h": int(0)
    },
    "determiners": {
        "h-m": int(0), 
        "m-h": int(0)
    }
    }
    nomerosa = udapi.Document(conllufile)
    for sentence in nomerosa.bundles:
        for token in sentence.nodes:
            if token.upos == 'NOUN':
                for mod in token.children: #search through modifiers
                    if mod.deprel == "amod": #do we have an adjectival modification?
                        if token.ord < mod.ord:
                            results[conllufile]["adjectives"]["h-m"] += 1
                        else:
                            results[conllufile]["adjectives"]["m-h"] += 1
                    if mod.deprel in ["acl:relcl","acl"]: #do we have a modifier acting as a relative clause?
                        if token.ord < mod.ord:
                            results[conllufile]["relative clauses"]["h-m"] += 1
                        else:
                            results[conllufile]["relative clauses"]["m-h"] += 1
                    if mod.deprel in ["det"]: #do we have determiners (very broad category here, it could be articles, demonstratives, quantifiers,...)?
                        if token.ord < mod.ord:
                            results[conllufile]["determiners"]["h-m"] += 1
                        else:
                            results[conllufile]["determiners"]["m-h"] += 1
pprint.pprint(results)
    

{'conllu/es_nomerosa.conllu': {'adjectives': {'h-m': 16, 'm-h': 8},
                               'determiners': {'h-m': 1, 'm-h': 48},
                               'relative clauses': {'h-m': 5, 'm-h': 0}},
 'conllu/fr_nomerosa.conllu': {'adjectives': {'h-m': 12, 'm-h': 11},
                               'determiners': {'h-m': 0, 'm-h': 57},
                               'relative clauses': {'h-m': 10, 'm-h': 0}},
 'conllu/it_nomerosa.conllu': {'adjectives': {'h-m': 12, 'm-h': 9},
                               'determiners': {'h-m': 0, 'm-h': 48},
                               'relative clauses': {'h-m': 6, 'm-h': 0}},
 'conllu/pt_nomerosa.conllu': {'adjectives': {'h-m': 11, 'm-h': 8},
                               'determiners': {'h-m': 0, 'm-h': 54},
                               'relative clauses': {'h-m': 7, 'm-h': 0}},
 'conllu/ro_nomerosa.conllu': {'adjectives': {'h-m': 10, 'm-h': 10},
                               'determiners': {'h-m': 4, 'm-h': 14},
                