## Assignment 2

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

## Settings

Import all the python libraries and read the dataset file. Here there are the functions to read the conll txt file. Also, I've removed all the _-DOCSTART-_ from the dataset.
The two main functions are:
- __create_corpus(data)__: this function reads the file and returns a list of list. Each list contain the string of the sentence just read from the dataset

ex:
```
[["SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT ."], ...]
```
- __create_corpus_ner(data)__: this function reads the file and returns a list of list. Each list contain the tuple _(Token, name_entity)_

ex:
```
[[('SOCCER', 'O'), ('-', 'O'), ('JAPAN', 'B-LOC'), ('GET', 'O'), ('LUCKY', 'O'), ('WIN', 'O'), (',', 'O'), ('CHINA', 'B-PER'), ('IN', 'O'), ('SURPRISE', 'O'), ('DEFEAT', 'O'), ('.', 'O')], ...]
```

---

__NOTE:__
It would be more efficient to process all the sentence and save the spacy doc in an external list, but for clarity, each function is re-processed by SpaCy nlp.

---


In [1]:
from numpy import exp
import conll
import spacy
from spacy.tokens import Span
import pandas as pd
from sklearn.metrics import classification_report

nlp = spacy.load('en_core_web_sm')

corpus_data = conll.read_corpus_conll("data/conll2003/train.txt")
test_data = conll.read_corpus_conll("data/conll2003/test.txt")

def read_data_fields(corpus_file):
    return [[data[0].split() for data in sent] for sent in corpus_file]

def get_data_text(data):
    return [sent[0] for sent in data]

def get_data_ner(data):
    return [(sent[0], sent[3]) for sent in data]

def create_corpus(read_corpus):
    text = []
    for sent in read_corpus:
        tmp = get_data_text(sent)
        if tmp != ["-DOCSTART-"]:
            text.append(" ".join(tmp))
    return text

def create_corpus_ner(read_corpus):
    ner_corpus = []
    for sent in read_corpus:
        tmp = get_data_ner(sent)
        if tmp[0][0] != "-DOCSTART-":
            ner_corpus.append(tmp)
    return ner_corpus

#################################################

# Read the datest from conll2003/test.txt

data = read_data_fields(test_data)
corpus_text = create_corpus(data)
corpus_ner = create_corpus_ner(data)

# Output example of create_corpus and corpus_corpus_ner

print(corpus_text[0])
print()
print(corpus_ner[0])

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .

[('SOCCER', 'O'), ('-', 'O'), ('JAPAN', 'B-LOC'), ('GET', 'O'), ('LUCKY', 'O'), ('WIN', 'O'), (',', 'O'), ('CHINA', 'B-PER'), ('IN', 'O'), ('SURPRISE', 'O'), ('DEFEAT', 'O'), ('.', 'O')]


## Point 0.1
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 

---

To report the token-level performance, the first thing to do is to convert the spacy token in a conll format. The entity types of spacy that do not belong to the conll format were linked to the _MISC_ type. Each sentence is processed by SpaCy nlp, and then post-processed to match the conll.

The function __spacy_token(doc)__ take as input a spacy doc and returns a new list with the same characteristic as the output of __create_corpus_ner__. It makes use of the _token.whitespace__ to detect when a token is read.

To convert the spacy entity type the function __convert_ent_type(ent_type)__ use a dictionay to return the new type as a string.

In [2]:
def convert_ent_type(ent_type):
    ent = {
    'PERSON': 'PER',
    'NORP': 'MISC',
    'FAC': 'LOC',
    'ORG': 'ORG',
    'GPE': 'LOC',
    'LOC': 'LOC',
    'PRODUCT': 'MISC',
    'EVENT': 'MISC',
    'WORK_OF_ART': 'MISC',
    'LAW': 'MISC',
    'LANGUAGE': 'MISC',
    'DATE': 'MISC',
    'TIME': 'MISC',
    'PERCENT': 'MISC',
    'MONEY': 'MISC',
    'QUANTITY': 'MISC',
    'ORDINAL': 'MISC',
    'CARDINAL': 'MISC',
    '': ''}
    return ent[ent_type]

def spacy_token(doc):
    token_list = []
    string = ""
    iob = ""
    first = True
    for ind, token in enumerate(doc):
        string += token.text
        # save only the fist iob of a ner
        if first:
            if token.ent_iob_ == "O": # O type
                iob = token.ent_iob_
            else:
                iob = token.ent_iob_ + "-" + convert_ent_type(token.ent_type_) # B-TYPE or I-TYPE
            first = False

        # when a space is found the ner is complete
        if token.whitespace_ == " ":
            token_list.append((string, iob))
            string = ""
            first = True

        # add last token
        if ind == len(doc)-1:
            token_list.append((string, iob))

    return token_list

#################################################

# Output example of spacy_token

print(spacy_token(nlp(corpus_text[2])))

[('AL-AIN', 'B-ORG'), (',', 'O'), ('United', 'B-ORG'), ('Arab', 'I-ORG'), ('Emirates', 'I-ORG'), ('1996-12-06', 'B-MISC')]


To report the accuracy-level, I go through the entire sentences of the dataset, process it with _nlp_ and then save all the spacy labels and conll labels in two lists. The report is then computed by the function of scikit-learn _classification_report_ in __accuracy_token_level(data)__, the output is the report given by scikit.

In [3]:
def accuracy_token_level(data):
    corpus_ner = create_corpus_ner(data)
    corpus_text = create_corpus(data)
    list_spacy = []
    list_total = []

    for index, sent in enumerate(corpus_ner):
        spacy_doc = nlp(corpus_text[index])
        tokens = spacy_token(spacy_doc)

        for i in range(len(tokens)):
            list_total.append(sent[i][1])
            list_spacy.append(tokens[i][1])

    return classification_report(list_total, list_spacy)

print(accuracy_token_level(data))

              precision    recall  f1-score   support

       B-LOC       0.76      0.68      0.72      1668
      B-MISC       0.10      0.57      0.17       702
       B-ORG       0.52      0.31      0.38      1661
       B-PER       0.80      0.63      0.70      1617
       I-LOC       0.54      0.56      0.55       257
      I-MISC       0.05      0.38      0.09       216
       I-ORG       0.42      0.51      0.46       835
       I-PER       0.84      0.79      0.81      1156
           O       0.94      0.86      0.90     38323

    accuracy                           0.81     46435
   macro avg       0.55      0.59      0.53     46435
weighted avg       0.89      0.81      0.84     46435



## Point 0.2

- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total
    
---

The chunk-level accuracy is computed by the __conll.evaluate__ function. Here it is a little bit simpler: whit __accuracy_chunk_level(data)__ I append in a list (_refs_) all the converted spacy tokens, and compare them with the dataset tokens inside the _corpus_ner_ list. The results are then stored and showed in a pandas dataframe.

In [4]:
def accuracy_chunk_level(data):
    corpus_ner = create_corpus_ner(data)
    corpus_text = create_corpus(data)
    hyps = corpus_ner
    refs = []

    for index in range(len(corpus_text)):
        spacy_doc = nlp(corpus_text[index])
        tokens = spacy_token(spacy_doc)
        refs.append(tokens)

    results = conll.evaluate(refs, hyps)
    pd_tbl = pd.DataFrame().from_dict(results, orient='index')
    return pd_tbl.round(decimals=3)

accuracy_chunk_level(data)

Unnamed: 0,p,r,f,s
LOC,0.672,0.748,0.708,1499
MISC,0.553,0.1,0.169,3879
ORG,0.276,0.464,0.346,989
PER,0.609,0.774,0.681,1271
total,0.523,0.386,0.444,7638


## Point 1. Grouping of Entities
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

---

The main function is __gropu_ner(data)__. This function takes as input a string of a sentence and returns a list of lists containing all the entities. To include all the single tokens which do not are in the chunk, a second for loop iterate all over the entities of the doc, excluding the one that is already in the noun_chunks.
To then count the frequency of each group recognized, the function __fr_comb(data)__ takes as input a list of sentences, process the sentences with __group_ner__ and then save each group in a dictionary, where the keys of this dictionary are the groups converted in string by the function __key_string(chunk)__.

I decided to keep the order of the entities distinct to preserve the "syntactic" meaning. The output is the ordered dictionary by the number of frequency.


In [5]:
def group_ner(data):
    spacy_doc = nlp(data)
    group = []
    exclude = []

    for chunk in spacy_doc.noun_chunks:
        ent = []
        for e in chunk.ents:
            ent.append(e.label_)
            exclude.append(e)
        if len(ent) > 0:
            group.append(ent)
    
    # add the remaining name entities if they are not in noun_chunks
    for e in spacy_doc.ents:
        if e not in exclude:
            if len([e.label_]) > 0:
                group.append([e.label_])

    return group

def key_string(chunk):
    keys = "["
    for i, el in enumerate(chunk):
        if i == 0:
            keys += el
        else:
            keys = ", ".join([keys, el])
    keys += "]"
    return keys

def fr_comb(data):
    dic_fr = {}
    for sent in data:
        ner_gr = group_ner(sent)
        for tok in ner_gr:
            key = key_string(tok)
            if key not in dic_fr.keys():
                dic_fr[key] = 0
            dic_fr[key] += 1

    return dict(sorted(dic_fr.items(), key=lambda item: item[1], reverse=True))

for key, value in fr_comb(corpus_text).items():
    print("{0} : {1}".format(key, value))

[CARDINAL] : 1624
[GPE] : 1255
[PERSON] : 1074
[DATE] : 997
[ORG] : 873
[NORP] : 293
[MONEY] : 147
[ORDINAL] : 111
[TIME] : 83
[PERCENT] : 81
[EVENT] : 58
[LOC] : 54
[CARDINAL, PERSON] : 51
[QUANTITY] : 51
[NORP, PERSON] : 43
[GPE, PERSON] : 34
[GPE, GPE] : 26
[FAC] : 22
[PRODUCT] : 22
[ORG, PERSON] : 21
[CARDINAL, ORG] : 19
[CARDINAL, NORP] : 15
[CARDINAL, GPE] : 13
[GPE, ORG] : 13
[LAW] : 11
[WORK_OF_ART] : 10
[GPE, PRODUCT] : 9
[DATE, EVENT] : 8
[DATE, ORG] : 8
[ORG, ORG] : 8
[PERSON, PERSON] : 8
[NORP, ORG] : 8
[DATE, TIME] : 7
[ORG, DATE] : 6
[LANGUAGE] : 6
[GPE, DATE] : 5
[CARDINAL, CARDINAL] : 5
[NORP, ORDINAL] : 5
[ORG, GPE] : 5
[DATE, NORP] : 5
[GPE, ORDINAL] : 4
[ORDINAL, PERSON] : 4
[GPE, CARDINAL] : 4
[ORG, NORP] : 4
[PERSON, GPE] : 4
[CARDINAL, DATE] : 3
[ORG, CARDINAL] : 3
[CARDINAL, PERSON, CARDINAL] : 3
[NORP, NORP] : 3
[PERSON, PERSON, PERSON] : 2
[CARDINAL, ORDINAL] : 2
[CARDINAL, CARDINAL, PERSON] : 2
[ORG, ORDINAL] : 2
[LANGUAGE, ORDINAL] : 2
[GPE, DATE, ORG] : 2
[O

## Point 2. 
One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

---

By looking at the Spacy [documentation](https://spacy.io/usage/linguistic-features#setting-entities), there is a handy function called __set_ents__. This function reset the entity if Spacy does not detect it and put it back in the doc. To expand the entities, I iterate over all the entities detected by spacy. Then I iterate over all the token of each entity to look up at the children at the immediate left and right.
If one of these tokens has the "compound" dependency relation and is outside the entity (ent._iob_ == 'O'), I expand it to include them in the new entity. The just extended entity may, in turn, have another dependency as "compound". All this process is done again, inside a while loop. If there are no more expanded entities, the loop ends and return the new doc with the refreshed entities.

In [6]:
def extend_entity(data):
    spacy_doc = nlp(data)
    new_doc = spacy_doc
    expand = True

    while expand:
        expand = False
        for ent in spacy_doc.ents:
            for tok in ent:
                for child in tok.children:
                    # if the compound is near the entity

                    if child.dep_ == "compound" and child.ent_iob_ == "O" and ent.start-1 >= 0 and child == spacy_doc[ent.start-1]:

                        fb_ent = Span(new_doc, ent.start-1, ent.end, label=ent.label_)
                        new_doc.set_ents([fb_ent], default="unmodified")
                        expand = True

                    if child.dep_ == "compound" and child.ent_iob_ == "O" and ent.end+1 < len(spacy_doc) and child == spacy_doc[ent.end+1]:

                        fb_ent = Span(new_doc, ent.start, ent.end+1, label=ent.label_)
                        new_doc.set_ents([fb_ent], default="unmodified")
                        expand = True

        spacy_doc = new_doc

    return new_doc

To calculate the accuray of the new expanded entities, I use a similar function that use the __conll.evaluate__ method with the new doc. 

From the results, it can be seen that the smaller are with this new method does not improve spacy accuracy.

In [7]:
def accuracy_expansion(data):
    corpus_ner = create_corpus_ner(data)
    corpus_text = create_corpus(data)
    hyps = corpus_ner
    refs = []

    for index in range(len(corpus_text)):
        spacy_doc = extend_entity(corpus_text[index])
        tokens = spacy_token(spacy_doc)
        refs.append(tokens)

    results = conll.evaluate(refs, hyps)
    pd_tbl = pd.DataFrame().from_dict(results, orient='index')
    return pd_tbl.round(decimals=3)

accuracy_expansion(data)

Unnamed: 0,p,r,f,s
LOC,0.653,0.726,0.688,1499
MISC,0.551,0.1,0.169,3879
ORG,0.272,0.456,0.34,989
PER,0.515,0.655,0.577,1271
total,0.489,0.361,0.415,7638


## Extras

This function returns as output a list containg the string of all the entities readed by SpaCy and its type

In [8]:
def add_whitespace(token, string):
    if token.whitespace_ == "":
        string += token.text
    else:
        string += token.text + " "
    return string

def spacy_to_conll(doc):
    conll = []
    string = ""
    ent_type = ""
    beginI = False
    for ind, token in enumerate(doc):
        if token.ent_iob_ == 'B':
            if beginI:
                conll.append((string.rstrip(), ent_type))
                string = ""
                beginI = False

            string = add_whitespace(token, string)
            ent_type = convert_ent_type(token.ent_type_)

            if ind == len(doc)-1:
                conll.append((string.rstrip(), ent_type))

            beginI = True

        elif token.ent_iob_ == 'I':
            string = add_whitespace(token, string)
            
            if ind == len(doc)-1:
                conll.append((string.rstrip(), ent_type))

        else: # 'O'
            if beginI:
                conll.append((string.rstrip(), ent_type))
                string = ""
                beginI = False

            string = add_whitespace(token, string)

            conll.append((string.rstrip(), ""))
            string = ""

    return conll

print(spacy_to_conll(nlp(corpus_text[2])))

[('AL-AIN', 'ORG'), (',', ''), ('United Arab Emirates', 'ORG'), ('1996-12-06', 'MISC')]
