# Assignment #5: Extraction of subject–verb–object triples
Author: Pierre Nugues

## Objectives

In this assignment, you will extract relations from a parsed sentence involving two words or entities. You will start with pairs of words, namely a subject and its verb, and then extend your programs to triples: subject, verb, and object. In the triples, the subject and the object are the entities, and the verb represents the relation. 

$$
\text{Subject} \xrightarrow[\text{}]{\text{Verb}} \text{Object}
$$

The overall work is inspired by the _Prismatic_ knowledge base used in the IBM Watson system, where the subject, verb, and object triples are a way to extract knowledge from text.  See <a href="http://www.aclweb.org/anthology/W/W10/W10-0915.pdf">this paper</a> for details. 

You will apply the extraction to multilingual texts: 
1. First you will use a parsed corpus of Swedish; and then
2. You will apply it to other languages.
            
The objectives of this assignment are to:
* Extract the subject–verb pairs from a parsed corpus
* Extend the extraction to subject–verb–object triples
* Understand how dependency parsing can help create a knowledge base
* Write a short report of 1 to 2 pages on the assignment

## Corpus

As corpora, you will use the Universal Dependencies: https://universaldependencies.org/.
1. In the first part of the assignment, you will focus on Swedish as it is easier to understand for most students, and then 
2. Move on to all the other languages. 

You will only consider the training sets of each corpus.

### Choosing a parsed corpus

1. Download the latest version of Universal dependencies (2.6) and uncompress them. You have a local version in the `/usr/local/cs/EDAN20/` folder on LTH's machines;
2. Go to the Swedish _Talbanken_ corpus;
3. Read the CoNLL-U annotation here: https://universaldependencies.org/format.html

### Examining the annotation

You will carry out the following steps and describe them in your report:

1. Draw graphical representations of the two first Swedish sentences of the training set. You will include these drawings in your report;
2. Visualize these sentences with this tool: http://spyysalo.github.io/conllu.js/ and check that you have the same results;
3. Apply the dependency parser for Swedish of the <a href="http://vilde.cs.lth.se:9000/">Langforia pipelines</a> to these sentences (only the text of each sentence). You will have to select Swedish and activate both `Token` and `DependencyRelation`. Link to Lanforia pipelines: <a href="http://vilde.cs.lth.se:9000/">http://vilde.cs.lth.se:9000/</a>. You will describe possible differences.

## Programming

### Swedish

You will extract all the subject–verb pairs and the subject–verb–object triples from the Swedish _Talbanken_ training corpus. To start the program, you can use the CoNLL-U reader available in the cells below.
This program works for the other corpora. You can also program a reader yourself starting from the one you used to read the CoNLL 2000 format in the fourth lab or from scratch. 

#### Imports

In [1]:
import os
import regex as re

### Corpus location

Here are the corpus locations you will use. You may have to adjust `ud_path`.

In [2]:
ud_path = '../../corpus/ud-treebanks-v2.6/'

In [3]:
path_sv = ud_path + 'UD_Swedish-Talbanken/sv_talbanken-ud-train.conllu'
path_fr = ud_path + 'UD_French-GSD/fr_gsd-ud-train.conllu'
path_ru = ud_path + 'UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu'
path_en = ud_path + 'UD_English-EWT/en_ewt-ud-train.conllu'

The column names of the CoNLL-U corpora

In [4]:
column_names_u = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']

#### Functions to read the CoNLL-U files

In [5]:
def read_sentences(file):
    """
    Creates a list of sentences from the corpus
    Each sentence is a string
    :param file:
    :return:
    """
    f = open(file).read().strip()
    sentences = f.split('\n\n')
    return sentences

In [6]:
def split_rows(sentences, column_names):
    """
    Creates a list of sentence where each sentence is a list of lines
    Each line is a dictionary of columns
    :param sentences:
    :param column_names:
    :return:
    """
    new_sentences = []
    root_values = ['0', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', '0', 'ROOT', '0', 'ROOT']
    start = [dict(zip(column_names, root_values))]
    for sentence in sentences:
        rows = sentence.split('\n')
        sentence = [dict(zip(column_names, row.split('\t'))) for row in rows if row[0] != '#']
        sentence = start + sentence
        new_sentences.append(sentence)
    return new_sentences

#### Reading the corpus

We load the Swedish Talbanken corpus.

In [7]:
sentences = read_sentences(path_sv)
formatted_corpus = split_rows(sentences, column_names_u)

In [8]:
len(formatted_corpus)

4303

The parsed sentence: _Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster._

In [9]:
formatted_corpus[1]

[{'ID': '0',
  'FORM': 'ROOT',
  'LEMMA': 'ROOT',
  'UPOS': 'ROOT',
  'XPOS': 'ROOT',
  'FEATS': 'ROOT',
  'HEAD': '0',
  'DEPREL': 'ROOT',
  'DEPS': '0',
  'MISC': 'ROOT'},
 {'ID': '1',
  'FORM': 'Genom',
  'LEMMA': 'genom',
  'UPOS': 'ADP',
  'XPOS': 'PP',
  'FEATS': '_',
  'HEAD': '2',
  'DEPREL': 'case',
  'DEPS': '2:case',
  'MISC': '_'},
 {'ID': '2',
  'FORM': 'skattereformen',
  'LEMMA': 'skattereform',
  'UPOS': 'NOUN',
  'XPOS': 'NN|UTR|SIN|DEF|NOM',
  'FEATS': 'Case=Nom|Definite=Def|Gender=Com|Number=Sing',
  'HEAD': '3',
  'DEPREL': 'obl',
  'DEPS': '3:obl:genom',
  'MISC': '_'},
 {'ID': '3',
  'FORM': 'införs',
  'LEMMA': 'införa',
  'UPOS': 'VERB',
  'XPOS': 'VB|PRS|SFO',
  'FEATS': 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass',
  'HEAD': '0',
  'DEPREL': 'root',
  'DEPS': '0:root',
  'MISC': '_'},
 {'ID': '4',
  'FORM': 'individuell',
  'LEMMA': 'individuell',
  'UPOS': 'ADJ',
  'XPOS': 'JJ|POS|UTR|SIN|IND|NOM',
  'FEATS': 'Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Numb

#### Converting the lists in dictionaries

To ease the processing of some corpora, you will use a dictionary represention of the sentences. The keys will be the `ID` values. We do this because `ID` is not necessarily a number.

In [10]:
def convert_to_dict(formatted_corpus):
    """
    Converts each sentence from a list of words to a dictionary where the keys are id
    :param formatted_corpus:
    :return:
    """
    formatted_corpus_dict = []
    for sentence in formatted_corpus:
        sentence_dict = {}
        for word in sentence:
            sentence_dict[word['ID']] = word
        formatted_corpus_dict.append(sentence_dict)
    return formatted_corpus_dict

In [11]:
formatted_corpus_dict = convert_to_dict(formatted_corpus)
formatted_corpus_dict[1]

{'0': {'ID': '0',
  'FORM': 'ROOT',
  'LEMMA': 'ROOT',
  'UPOS': 'ROOT',
  'XPOS': 'ROOT',
  'FEATS': 'ROOT',
  'HEAD': '0',
  'DEPREL': 'ROOT',
  'DEPS': '0',
  'MISC': 'ROOT'},
 '1': {'ID': '1',
  'FORM': 'Genom',
  'LEMMA': 'genom',
  'UPOS': 'ADP',
  'XPOS': 'PP',
  'FEATS': '_',
  'HEAD': '2',
  'DEPREL': 'case',
  'DEPS': '2:case',
  'MISC': '_'},
 '2': {'ID': '2',
  'FORM': 'skattereformen',
  'LEMMA': 'skattereform',
  'UPOS': 'NOUN',
  'XPOS': 'NN|UTR|SIN|DEF|NOM',
  'FEATS': 'Case=Nom|Definite=Def|Gender=Com|Number=Sing',
  'HEAD': '3',
  'DEPREL': 'obl',
  'DEPS': '3:obl:genom',
  'MISC': '_'},
 '3': {'ID': '3',
  'FORM': 'införs',
  'LEMMA': 'införa',
  'UPOS': 'VERB',
  'XPOS': 'VB|PRS|SFO',
  'FEATS': 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass',
  'HEAD': '0',
  'DEPREL': 'root',
  'DEPS': '0:root',
  'MISC': '_'},
 '4': {'ID': '4',
  'FORM': 'individuell',
  'LEMMA': 'individuell',
  'UPOS': 'ADJ',
  'XPOS': 'JJ|POS|UTR|SIN|IND|NOM',
  'FEATS': 'Case=Nom|Definite=Ind|D

### Extracting the subject-verb pairs

Now you will extract the subject-verb pairs, where you will set the words in lowercase. In the second sentence of the corpus, this corresponds to `(beskattning, införs)`. You will call the function `extract_pairs(formatted_corpus_dict)` and and you will store the results in a `pairs_sv` variable. All the corpora in the universal dependencies format use the same function names: `nsubj` and `obj` for the subject and direct object.

You can use the algorithm you want. However, here are some hints on the results:
* You will extract all the subject-verb pairs in the corpus. In the extraction, just check the function between two words. Do not check if the part of speech is a verb or a noun in the pair. You will also ignore the possible function suffixes as in `nsubj:pass`, where `pass` means passive.
* You will return the results as Python's dictionaries, where the key will be the pair and the value, the count, as for instance `{(beskattning, införs): 1}`. Be sure you understand the Python dictionaries and note that you can use tuples as keys.

In [12]:
# Write your code here

In [13]:
pairs_sv = extract_pairs(formatted_corpus_dict)

You will compute the total number of subject-verb pairs. You should find 6,083 pairs.

In [14]:
sum([pairs_sv[pair] for pair in pairs_sv])

6083

#### Finding the most frequent pairs

You will sort your pairs by frequency and by lexical order of the pairs and store the five most frequent pairs in the `freq_pairs_sv` variable as in:
```
freq_pairs_sv = [(('som', 'har'), 45),

 ...]
 ````

Here are the frequencies you should find:
```
45
19
19
```

In all the experiments, we will keep the `nbest` most frequent. In the first experiments, we set `nbest` to 3 first. We will set it to 5 in the last experiment.

In [15]:
nbest = 3

In [16]:
sorted_pairs = sorted(pairs_sv, key=lambda x: (-pairs_sv[x], x))
freq_pairs_sv = [(pair, pairs_sv[pair]) for pair in sorted_pairs][:nbest]

In [None]:
freq_pairs_sv

### Extracting the subject-verb-object triples

You will now extract all the subject–verb–object triples of the corpus. The object function uses the `obj` code.

In [18]:
# Write your code here

In [19]:
triples_sv = extract_triples(formatted_corpus_dict)

Compute the total number of triples. You should find 2054 triples.

In [20]:
sum([triples_sv[triple] for triple in triples_sv])

2054

#### Finding the most frequent triples

You will sort your triples by frequency and by lexical order of the pairs and store the three most frequent triples in the `freq_triples_sv` variable as in:
```
freq_triples_sv = [(('man', 'vänder', 'sig'), 14),

 ...]
 ````

Here are the frequencies you should find:
```
14
5
3
```

In [21]:
sorted_triples = sorted(triples_sv, key=lambda x: (-triples_sv[x], x))
freq_triples_sv = [(triple, triples_sv[triple]) for triple in sorted_triples][:nbest]

In [None]:
freq_triples_sv

### Multilingual Corpora

Once your program is working on Swedish, you will apply it to all the other languages in universal dependencies. The code below returns all the files from a folder with a suffix. Here we consider the training files only.

In [23]:
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    Recursive version
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    files = []
    for file in os.listdir(dir):
        path = dir + '/' + file
        if os.path.isdir(path):
            files += get_files(path, suffix)
        elif os.path.isfile(path) and file.endswith(suffix):
            files.append(path)
    return files

In [24]:
files = get_files(ud_path, 'train.conllu')

#### Dealing with the indices

Some corpora expand some tokens into multiwords. This is the case in French, Spanish, and German.
        The table below shows examples of such expansions.
        <table style="width:100%">
            <tr>
                <th>French</th>
                <th>Spanish</th>
                <th>German</th>
            </tr>
            <tr>
                <td><i>du</i>: de le
                </td>
                <td><i>del</i>: de el
                </td>
                <td><i>zur</i>: zu der
                </td>
            </tr>
            <tr>
                <td><i>des</i>: de les
                </td>
                <td><i>vámonos</i>: vamos nos
                </td>
                <td><i>im</i>: in dem
                </td>
            </tr>
        </table>
        In the corpora, you have the original tokens as well as the multiwords as with <i>vámonos al mar</i>.
        <pre>
1-2 vámonos _
1 vamos ir
2 nos nosotros
3-4 al _
3 a a
4 el el
5 mar mar
</pre>Read the format description for the details: [<a
                href="http://universaldependencies.org/format.html">CoNLL-U format</a>].

If you represent the sentences as lists, the item indices are not reliable: In the format description,
        the token at position 1 is <i>vamos</i> and not <i>vámonos</i>.
        You have two ways to cope with this:
1. Either remove all the lines that include a range in the `ID` field, or
2. Encode the sentences as dictionaries (I felt this was preferable), where the keys are the `ID` numbers. This is what `convert_to_dict()` does. Here are the results for a sentence from the French CoNLL-U corpus:
_Les iris du mâles sont jaunes toute l'année._ Note the `3-4` index and it expansion in `3`and `4`:
```
{'0': {'ID': '0',  'FORM': 'ROOT',  'LEMMA': 'ROOT',  'UPOS': 'ROOT',  'XPOS': 'ROOT',  'FEATS': 'ROOT',  'HEAD': '0',  'DEPREL': 'ROOT',  'DEPS': '0',  'MISC': 'ROOT'}, 
'1': {'ID': '1',  'FORM': 'Les',  'LEMMA': 'le',  'UPOS': 'DET',  'XPOS': '_',  'FEATS': 'Definite=Def|Gender=Masc|Number=Plur|PronType=Art',  'HEAD': '2',  'DEPREL': 'det',  'DEPS': '_',  'MISC': 'wordform=les'}, 
'2': {'ID': '2',  'FORM': 'iris',  'LEMMA': 'iris',  'UPOS': 'NOUN',  'XPOS': '_',  'FEATS': 'Gender=Masc|Number=Plur',  'HEAD': '7',  'DEPREL': 'nsubj',  'DEPS': '_',  'MISC': '_'}, 
'3-4': {'ID': '3-4',  'FORM': 'du',  'LEMMA': '_',  'UPOS': '_',  'XPOS': '_',  'FEATS': '_',  'HEAD': '_',  'DEPREL': '_',  'DEPS': '_',  'MISC': '_'}, 
'3': {'ID': '3',  'FORM': 'de',  'LEMMA': 'de',  'UPOS': 'ADP',  'XPOS': '_',  'FEATS': '_',  'HEAD': '5',  'DEPREL': 'case',  'DEPS': '_',  'MISC': '_'}, 
'4': {'ID': '4',  'FORM': 'le',  'LEMMA': 'le',  'UPOS': 'DET',  'XPOS': '_',  'FEATS': 'Definite=Def|Gender=Masc|Number=Sing|PronType=Art',  'HEAD': '5',  'DEPREL': 'det',  'DEPS': '_',  'MISC': '_'}, 
'5': {'ID': '5',  'FORM': 'mâles',  'LEMMA': 'mâle',  'UPOS': 'NOUN',  'XPOS': '_',  'FEATS': 'Gender=Masc|Number=Plur',  'HEAD': '2',  'DEPREL': 'nmod',  'DEPS': '_',  'MISC': '_'}, 
'6': {'ID': '6',  'FORM': 'sont',  'LEMMA': 'être',  'UPOS': 'AUX',  'XPOS': '_',  'FEATS': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin',  'HEAD': '7',  'DEPREL': 'cop',  'DEPS': '_',  'MISC': '_'}, 
'7': {'ID': '7',  'FORM': 'jaunes',  'LEMMA': 'jaune',  'UPOS': 'ADJ',  'XPOS': '_',  'FEATS': 'Gender=Masc|Number=Plur',  'HEAD': '0',  'DEPREL': 'root',  'DEPS': '_',  'MISC': '_'}, 
'8': {'ID': '8',  'FORM': 'toute',  'LEMMA': 'tout',  'UPOS': 'ADJ',  'XPOS': '_',  'FEATS': 'Gender=Fem|Number=Sing',  'HEAD': '10',  'DEPREL': 'amod',  'DEPS': '_',  'MISC': '_'}, 
'9': {'ID': '9',  'FORM': "l'",  'LEMMA': 'le',  'UPOS': 'DET',  'XPOS': '_',  'FEATS': 'Definite=Def|Gender=Fem|Number=Sing|PronType=Art',  'HEAD': '10',  'DEPREL': 'det',  'DEPS': '_',  'MISC': 'SpaceAfter=No'}, 
'10': {'ID': '10',  'FORM': 'année',  'LEMMA': 'année',  'UPOS': 'NOUN',  'XPOS': '_',  'FEATS': 'Gender=Fem|Number=Sing',  'HEAD': '7',  'DEPREL': 'obl',  'DEPS': '_',  'MISC': 'SpaceAfter=No'}, 
'11': {'ID': '11',  'FORM': '.',  'LEMMA': '.',  'UPOS': 'PUNCT',  'XPOS': '_',  'FEATS': '_',  'HEAD': '7',  'DEPREL': 'punct',  'DEPS': '_',  'MISC': '_'}}
```
3. Some corpora have sentence numbers. You solve it by discarding lines starting with a `#`. This is already done in the CoNLL reader.

#### Extracting the pairs and triples

Write a function `extract_pairs_and_triples(formatted_corpus_dict, nbest)` that extracts the `nbest` most frequent pairs and triples of a given corpus and returns two sorted lists of tuples: `frequent_pairs` and `frequent_triples`. You will sort them by frequency and then by alphabetical order of the pair or triple.

In [25]:
# Write your code here

Run your extractor on all the corpora. Note that some corpora have replaced the words by underscores as for one corpus n French. You need then to contact the provider to obtain them.

In [None]:
# Write your code here

In your report, you will include the `nbest` most frequent pairs and triples you obtained in **three languages**. You may choose the ones you want.

For the checking script, you will extract `nbest` triples in French, Russian, and English. You will rank these triples by frequency, and then by alphabetical order of the triple using `sorted()`. You will use the French GSD corpus, the Russian SynTagRus corpus, and the English EWT corpus. You will store these triples in the following variables:
`freq_triples_fr`, `freq_triples_ru`, `freq_triples_en`. Each variable will contain a list of tuples: `(subject, verb, object), freq)`

Here is what you should find:

French
```
freq_triples_fr = [(('il', 'fait', 'partie'), 16),

 ...]
 ````

And the frequencies:
```
16
7
7
```

Russian:
```
freq_triples_ru = [(('мы', 'имеем', 'дело'), 6),

 ...]
 ````

And the frequencies:
```
6
4
4
```

English:
```
freq_triples_en = [(('you', 'have', 'questions'), 22),

 ...]
 ````

And the frequencies:
```
22
12
7
```

In [27]:
files = [path_fr, path_ru, path_en]

In [28]:
# Write your code here

In [None]:
freq_pairs_fr, freq_triples_fr = extract_pairs_and_triples(formatted_corpus_dict, nbest)
freq_triples_fr

In [30]:
# Write your code here

In [None]:
freq_pairs_ru, freq_triples_ru = extract_pairs_and_triples(formatted_corpus_dict, nbest)
freq_triples_ru

In [32]:
# Write your code here

In [None]:
freq_pairs_en, freq_triples_en = extract_pairs_and_triples(formatted_corpus_dict, nbest)
freq_triples_en

## Resolving the entities

You will now extract the relations involving named entities, that is where both the subject and the object are proper nouns. 

Write an `extract_entity_triples(formatted_corpus_dict)` that will process the corpus and return a list of `(subject, verb, object)` triples. You will leave the case as it is in the form, for instance _United States_ and not _united states_.  

In [34]:
# Write your code here

You will run the `extract_entity_triples()` function one the English EWT corpus. You will store the list in the `entity_relation_en` variable and you will sort it with `sorted()`. You will keep the **five** first triples. 

In [35]:
nbest = 5

The two first triples are:
```
[('Baba', 'remember', 'George'),
 ('Beschta', 'told', 'Planet'),
...]
 ```
Note that this time, we keep the original case and the triples are in the alphabetical order.

In [36]:
#Write your code here

In [None]:
entity_relation_en = sorted(entity_relation_en)[:nbest]
entity_relation_en

### Optional exercise: Extracting the chunks

Extracting only the headword of the subject and object if often incomplete and uninformative. You can extract all the chunk instead. As an optional exercise, you can try a baseline technique and extract adjacent proper nouns. You may also want to apply the chunker of the 4th assignment to the corpus to do this.

### Optional exercise: Mapping the entities

As in the chunker assignment, you may also want to complement your assignment with a entity solver that will link the entities to wikidata.

## Reading

Read the article: _PRISMATIC: Inducing Knowledge from a Large Scale Lexicalized Relation Resource_ by Fan and al. (2010) [<a href="http://www.aclweb.org/anthology/W/W10/W10-0915.pdf">pdf</a>] and write in a few sentences how it relates to your work in this assignment.

## Submission

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [38]:
STIL_ID = ["da20exampl-s", "ad02exampl-z"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "5-triples.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the pairs and triples in four languages, as well as the triples with named entities.

In [None]:
import json
ANSWER = json.dumps({'freq_triples_sv': freq_triples_sv,
                     'freq_triples_fr': freq_triples_fr,
                     'freq_triples_ru': freq_triples_ru,
                     'freq_triples_en': freq_triples_en,
                     'entity_relation_en': entity_relation_en
                    })
ANSWER

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [40]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [41]:
import bz2
ASSIGNMENT = 5
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [42]:
import requests
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': '53f5214c1ebd4148ad0c6591031e4a75838f828f107e8b6b7afe9b8a107238bf1d934ad2b83aad3b20367f4aa4e14bc7fa1f0b38b0ad62a4a8a80f27b02f7be7',
 'submission_id': '4e6f5a78-0605-412b-8eab-78c012eb9b31'}