<b>

<p>
<center>
<font size="5">
Natural Language Processing
</font>
</center>
</p>

<p>
<center>
<font size="4">
Entity extraction with hand annotation and SpaCy
</font>
</center>
</p>

<p>
<center>
<font size="3">
Author: Sandra Valdés Salas
</font>
</center>
</p>


<p>
<center>
<font size="3">
April 2020
</font>
</center>
</p>

</b>

## _Table of Contents_
* [Part 1: Compare hand annotations](#first)
    * [1) Build model](#first1.1)
    * [2) Compare files](#first1.2)
    * [3) Calculate Cohen's Kappa](#first1.3)
    * [4) Results](#first1.4)

* [Part 2: Compare entities extracted by 2 models](#second)
    * [1) Build method to compare entities from dataset of news](#second1.1)
    * [2) Extract entities with Model 1 as reference and Model 2 as test](#second1.2)
    * [3) Recall and Precision](#second1.3)
    * [4) Results](#second1.4)

## _Introduction_

An important task in Natural Language Processing is the identification of entities (person, locations, organizations, geopolitical entities, etc.) in a document. Entities can be hand annotated with tools such as [Dataturks.com](https://dataturks.com/). However, this can be problematic due to discrepancies between the hand annotators. 

The first part of this exercise compares the annotations between two annotators that tagged entities in 5 documents. Agreement between the 2 annotators is calculated with Cohen Kappa's coefficient.

The second part of this exercise compares the entities extracted from a corpus of news articles. Entities were extracted with two SpaCy models. By considering the model 1 ("en_core_web_sm") as a reference model and the model 2 ("en_core_web_md") as a test model, the folllowing entities are identified:

- Identified: entities that are in reference and test models 
- Unidentified: entities that are in the reference model but not in test model 
- Spurious: entities that are in test model but not in reference model

Finally, recall and precision are calculated. 

## _Import libraries_

In [1]:
# import json
import json
from pprint import pprint


# method to read annotation file
def annotation_processor(annotation_file):
    annotation_array = []

    # here we need to be careful and process each line of the annotation file separately
    read_annotation = open(annotation_file)
    for line in read_annotation:
        data = json.loads(line)
        annotation_array.append(data)

    # here we return an array of the individual annotations
    return annotation_array
    
# calling the annotation processor function
#annotation_processor('./annotated_data/annotated.json')

In [2]:
# here we will create two objects to store the reference annotations and your own annotations
annotations_array_person_A = annotation_processor('./annotated_data/annotated.json')

# here I am just putting the same file in... if I do this I would expect a perfect match
annotations_array_person_B = annotation_processor('./annotated_data/my_annotations.json')


Each array is a dictionary with annotation, content, extras, metadata. 

In [3]:
#SAMPLE FROM REFERENCE ANNOTATION
one_annotation_from_array = annotations_array_person_A[0]
one_annotation_from_array

{'annotation': [{'label': ['Person'],
   'points': [{'end': 386, 'start': 379, 'text': 'Blackmon'}]},
  {'label': ['Person'],
   'points': [{'end': 296, 'start': 293, 'text': 'Irma'}]},
  {'label': ['Person'],
   'points': [{'end': 272, 'start': 267, 'text': 'Harvey'}]},
  {'label': ['Person'],
   'points': [{'end': 79, 'start': 74, 'text': 'Hardin'}]},
  {'label': ['Person'],
   'points': [{'end': 71, 'start': 64, 'text': 'Blackmon'}]},
  {'label': ['Person'],
   'points': [{'end': 71,
     'start': 42,
     'text': 'Chief Deputy Jonathan Blackmon'}]}],
 'content': 'According to Polk County Sheriff’s Office Chief Deputy Jonathan Blackmon, Hardin is a Rome native who worked with RomeCares and Floyd Sheriff’s officials to get the supplies to a disaster relief zone.\n\n“We originally took in the supplies for the victims of Hurricane Harvey in Texas, but when Irma hit closer to home we felt it would be more beneficial to send them to Florida,” Blackmon said.\n\nThe team that went to Flori

# Part 1: Compare hand annotations <a class="anchor" id="first"></a>

## 1) Build method <a class="anchor" id="first1.1"></a>

### 1.1. Function to extract points from annotations

In [4]:
#define a function to extract relevant information from the annotations

def points_extractor(one_annotation_from_array):
    """
    This function evaluates the annotations of one document.
    Returns: a list of dicionaries with end points, start points and text of the entity.
    """
    
    # Extract values from annotation
    annotation_information = {key:value for (key,value) in one_annotation_from_array.items() if key=='annotation'}

    points_list = []
    
    #extract the points if the document has a "person" tagged
    try:
        # extract information that matches the key = "annotation"
        for key, value in annotation_information.items():
            # iterate items in values of the dictionary
            for items in value:
                # Get points and append to points:list
                for item in items.get('points'):
                    points_list.append(item)
                            
    # Return "empty list" in case the document has no "person" tagged
    except:
        return []
    
    return points_list

In [5]:
# Sample from annotations reference
one_annotation_person_A = annotations_array_person_A[1]
annotations_points_person_A = points_extractor(one_annotation_person_A)
annotations_points_person_A

[{'end': 460, 'start': 453, 'text': 'Morrison'},
 {'end': 325, 'start': 315, 'text': 'the leaders'},
 {'end': 91, 'start': 80, 'text': 'eter O’Neill'},
 {'end': 39, 'start': 32, 'text': 'Morrison'},
 {'end': 39, 'start': 26, 'text': 'Scott Morrison'}]

In [6]:
# Sample from my annotations
one_annotation_person_B = annotations_array_person_B[1]
annotations_points_person_B = points_extractor(one_annotation_person_B)
annotations_points_person_B

[{'end': 460, 'start': 453, 'text': 'Morrison'},
 {'end': 325, 'start': 315, 'text': 'the leaders'},
 {'end': 91, 'start': 79, 'text': 'Peter O’Neill'},
 {'end': 39, 'start': 32, 'text': 'Morrison'},
 {'end': 39, 'start': 26, 'text': 'Scott Morrison'}]

### 1.2. Method to count matches between my annotations and reference annotations

The **points_extractor function** returns a list of dictionaries with the start, end and text information for each entity tagged. By converting this list into a set of tuples, we can compare "my annotations" with "reference annotations" by taking a look at their intersection and symmetric difference. 

In [7]:
#Compare start" & "end" reference annotations and my annotations 
# Get a set of tuples for each annotation
set_A = set(tuple(d.items()) for d in annotations_points_person_A)
set_B = set(tuple(d.items()) for d in annotations_points_person_B)

#Look at common tuples
hits = set_A.intersection(set_B)
print('Total hits: {}'.format(len(hits)))
print('----')
print('Hits: {}'.format(hits))
print('\n')

#Look at uncommon tuples
missed = set_A.symmetric_difference(set_B)
print('Total missed: {}'.format(len(missed)))
print('----')
print('Missed: {}'.format(missed))


Total hits: 4
----
Hits: {(('start', 315), ('end', 325), ('text', 'the leaders')), (('start', 32), ('end', 39), ('text', 'Morrison')), (('start', 26), ('end', 39), ('text', 'Scott Morrison')), (('start', 453), ('end', 460), ('text', 'Morrison'))}


Total missed: 2
----
Missed: {(('start', 79), ('end', 91), ('text', 'Peter O’Neill')), (('start', 80), ('end', 91), ('text', 'eter O’Neill'))}


### 1.3. Putting everything together... 

In [8]:
# Define function to compare tuples and get number of hits and misses

def compare_annotations(annotations_A, annotations_B):
    """
    This function compares two different annotations for the same document. 
    Returns: categories matches, non_matches, partial_matches. 
    Each category contains a list of dictionaries.
    """
    
    #Extract 'points' using points_extractor function
    #The result is a list of dictionaries with the keys ->'end', 'start', 'text' 
    annotations_points_person_A = points_extractor(annotations_A)
    annotations_points_person_B = points_extractor(annotations_B)
    
    #Convert the above results into a set of tuples
    set_A = set(tuple(d.items()) for d in annotations_points_person_A)
    set_B = set(tuple(d.items()) for d in annotations_points_person_B)
    
    # calculate hits -> intersection 
    intersection = set_A.intersection(set_B)
    # calculate missed -> symmetric difference 
    symm_difference = set_A.symmetric_difference(set_B)
    
    #re convert set of tuples to dictionary 
    final_hits = [dict(tuple) for tuple in list(intersection)]
    missed = [dict(tuple) for tuple in list(symm_difference)]
    
    #--------BONUS---------
    # Calculate partial matches
    partial = []
    index = 0
    while index < len(missed)-1:
        # Check if "start" matches
        if missed[index]['start']==missed[index+1]['start']:
            partial.append(missed[index])
        # Check if "end" matches
        elif missed[index]['end']==missed[index+1]['end']:
            partial.append(missed[index])
        index+=1
    #-----END OF BONUS------
    
    # Delete missed info that appears in partial info
    final_missed = []
    final_partial = []
    for dict_ in missed:
        if dict_ in partial:
            final_partial.append(dict_)
        else:
            final_missed.append(dict_)
    
    return final_hits, final_missed, final_partial

## 2) Compare files   <a class="anchor" id="first1.2"></a>

In [9]:
def compare_annotation_files(array_A, array_B):
    """
    This function compares two different annotations across many documents. 
    Returns: 
    the number of matches, number of non mathces and number of partial matches
    between the two arrays.
    """
    
    #Keep record of entities by category
    all_matches = []
    all_non_matches = []
    all_partial_matches = []
    
    #Keep record of number of entities by category 
    num_matches = 0
    num_non_matches = 0
    num_partial_matches = 0

    # for each annotation in the reference_annotations (person A)
    for annotation in array_A:
        # for each annotation in my_annotations (person B)
        for other_annotation in array_B:
            # Match content from both arrays
            if (annotation["content"] == other_annotation["content"]):

                # Compare annotations 
                matches, non_matches, partial_matches = compare_annotations(annotation, other_annotation)
                
                # Append results to lists of entities by category
                all_matches.append(matches)
                all_non_matches.append(non_matches)
                all_partial_matches.append(partial_matches) 
                
                # Calculate numbers
                num_matches += len(matches)
                num_non_matches += len(non_matches)
                num_partial_matches += len(partial_matches)

    print('Success!!')
    print('*************')
    print('Hits: ', num_matches)
    print('Missed: ', num_non_matches)
    print('Partial hits: ', num_partial_matches)
    
    return all_matches, all_non_matches, all_partial_matches

In [10]:
hits, missed, partial = compare_annotation_files(annotations_array_person_A, annotations_array_person_B)

Success!!
*************
Hits:  21
Missed:  11
Partial hits:  2


## 3) Calculate Cohen's Kappa  <a class="anchor" id="first1.3"></a>

**Calculate Cohen Kappa score with a chance of agreement of 0.3.**

In [11]:
observed_agreement = (len(hits) / (len(hits)+len(missed)))
print('Observed agreement: ', observed_agreement)

Observed agreement:  0.5


In [12]:
chance = 0.3
kappa = (observed_agreement-chance)/(1-chance)
print('Kappa score with {} chance of agreement is: {}'.format(chance, round(kappa,4)))

Kappa score with 0.3 chance of agreement is: 0.2857


## 4) Results   <a class="anchor" id="first1.4"></a>


Identifying a "Person" is difficult. By comparing both annotations, we can identify the following types mistakes:
- **1) Tagging the start or end of the entity incorrectly.** For instance, in the reference annotations we can find a comma at the end of "Piper Merrit," or a missing "P" at the beginning of "eter O'Neill". In this case, it is useful to identify partial matchings. 
- **2) Tagging (or not tagging) a combination of words associated to the entity.** For instance, I did not include _Chief Deputy_ when tagging "Jonathan Blackmon". In this case, clear rules for hand annotation are useful for avoiding this type of events. 
- **3) Tagging unambiguous words**. For instance, considering _'A group of students and teachers'_ as a Person would depend on the annotator. 
- **4) Tagging names that refer to other entities (events, location, organization).** For instance, in the reference annotations, hurricanes "Harvey" and "Irma", as well as "George Washington" (which in the document refered to a location), were incorrectly tagged as Person. 

The Kappa Score between the reference annotations and my annotations was 0.2857. If we consider that the chance of agreement is 0.3, then we can conclude that this score is high (even though perfect agreement is 1). The chance of agreement, however, is relative. Thus the Cohen Kappa score could not be particular useful. 

In [13]:
# Partially matched entities in document 2 and 4
partial

[[],
 [{'end': 91, 'start': 79, 'text': 'Peter O’Neill'}],
 [],
 [{'end': 231, 'start': 218, 'text': 'Piper Merritt,'}],
 []]

In [14]:
# Succesfully matched entities in documents 1 to 4
hits

[[{'end': 386, 'start': 379, 'text': 'Blackmon'},
  {'end': 71, 'start': 64, 'text': 'Blackmon'}],
 [{'end': 325, 'start': 315, 'text': 'the leaders'},
  {'end': 39, 'start': 32, 'text': 'Morrison'},
  {'end': 39, 'start': 26, 'text': 'Scott Morrison'},
  {'end': 460, 'start': 453, 'text': 'Morrison'}],
 [{'end': 10, 'start': 0, 'text': 'Joe Montana'},
  {'end': 275, 'start': 269, 'text': 'Montana'},
  {'end': 10, 'start': 4, 'text': 'Montana'},
  {'end': 44, 'start': 33, 'text': 'Dwight Clark'},
  {'end': 259, 'start': 245, 'text': 'the quarterback'}],
 [{'end': 537, 'start': 532, 'text': 'Walker'},
  {'end': 710, 'start': 701, 'text': 'John Deere'},
  {'end': 825, 'start': 816, 'text': 'John Deere'},
  {'end': 34, 'start': 25, 'text': 'John Deere'},
  {'end': 510, 'start': 501, 'text': 'John Deere'},
  {'end': 594, 'start': 588, 'text': 'Merritt'},
  {'end': 230, 'start': 224, 'text': 'Merritt'},
  {'end': 581, 'start': 576, 'text': 'Walker'},
  {'end': 537, 'start': 527, 'text': 'Do

In [15]:
# Mistakenly identified entities in all documents
missed

[[{'end': 79, 'start': 74, 'text': 'Hardin'},
  {'end': 272, 'start': 267, 'text': 'Harvey'},
  {'end': 71, 'start': 42, 'text': 'Chief Deputy Jonathan Blackmon'},
  {'end': 296, 'start': 293, 'text': 'Irma'},
  {'end': 71, 'start': 55, 'text': 'Jonathan Blackmon'}],
 [{'end': 91, 'start': 80, 'text': 'eter O’Neill'}],
 [{'end': 420, 'start': 416, 'text': 'Clark'},
  {'end': 44, 'start': 40, 'text': 'Clark'}],
 [{'end': 230, 'start': 218, 'text': 'Piper Merritt'}],
 [{'end': 116, 'start': 85, 'text': 'A group of students and teachers'},
  {'end': 60, 'start': 44, 'text': 'George Washington'}]]

# Part 2: Compare entities extracted by 2 models <a class="anchor" id="second"></a>

## 1) Build method to compare entities from dataset of news <a class="anchor" id="second1.1"></a>

### 1.1. Import corpus

In [16]:
import spacy
model_1 = spacy.load("en_core_web_sm")
model_2 = spacy.load('en_core_web_md')

In [17]:
from os import listdir
from os.path import isfile, join

In [19]:
dir_base = "./news_data/"

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text

def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        print(file_text)
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# UNCOMMENT THIS LINE TO GENERATE THE RESULTS OF PART 2:

#text_corpus = read_directory_files(dir_base)

### 1.2. Method to extract entities 

In [19]:
# extract entities
def get_entities(document_text, model):
    '''
    This function returns entities from a document tagged by a specific spacy model
    '''
    analyzed_doc = model(document_text)
    return [entity for entity in analyzed_doc.ents if entity.label_ in ["PERSON", "ORG", "LOC", "GPE"]]

### 1.3. Method to compare entities extracted by 2 models

In [20]:
def compare_entities_from_document(reference_entities, test_entities):
    '''
    This function compares the entities identified by a reference model and a test model. 
    The function returns the following categories:
    - Identified: entities that are in reference and test models 
    - Unidentified: entities that are in the reference model but not in test model 
    - Spurious: entities that are in test model but not in reference model
    '''
    
    correct_identified_entities = []
    correct_unidentified_entities = []
    spurious_identified_entites = []
    
    # items in the test set that are also in the reference set
    for ent_test in test_entities:
        for ent_ref in reference_entities:
            # if the text and label are equal, append to identified list
            if (ent_test.text==ent_ref.text) and (ent_test.label_==ent_ref.label_):
                correct_identified_entities.append(ent_test)
            # if the above condition is not met, append to unidentified list
            elif (ent_test.text==ent_ref.text) and (ent_test.label_!=ent_ref.label_):
                correct_unidentified_entities.append(ent_test)
    
    # items in the reference set and that are NOT IN THE TEST SET 
    for ent_ref in reference_entities:
        if ent_ref not in test_entities:
            correct_unidentified_entities.append(ent_ref)
    
    # items in the test set that are NOT IN THE REFERENCE SET
    for ent_test in test_entities:
        if ent_test not in reference_entities and ent_test not in correct_identified_entities:
            spurious_identified_entites.append(ent_test)
    
    return correct_identified_entities, correct_unidentified_entities, spurious_identified_entites
    

## 2) Extract entities with Model 1 as reference and Model 2 as test <a class="anchor" id="second1.2"></a>

In [21]:
# Categories list
overall_identified_entities = []
overall_unidentified_entities = []
overall_spurious_entities = []

for document in text_corpus:
    
    # *******Set model reference and model test*******
    entities_1 = get_entities(document["content"], model_1)
    entities_2 = get_entities(document["content"], model_2)
    # ***********************************************
    
    # Apply function to identify entites by category
    identified, unidentified, spurious = compare_entities_from_document(entities_1, entities_2)
    
    # Append list of entities extracted in every document to main lists
    overall_identified_entities.append(identified)
    overall_unidentified_entities.append(unidentified)
    overall_spurious_entities.append(spurious)
    

The output is a list of lists (for each category). The later contains the entities extracted from every document.
For example, below is the first list of "correctly identified entities" from document 1.  

In [22]:
# We can define a function to flatten the list of lists and to manipulate the date more easily.
def flatten_list(list_):
    return [ent for entity in list_ for ent in entity]

In [23]:
# Apply function to model 1 as reference and model 2 as test
identified_entities = flatten_list(overall_identified_entities)
unidentified_entities = flatten_list(overall_unidentified_entities)
spurious_entites = flatten_list(overall_spurious_entities)

## 3) Recall and Precision <a class="anchor" id="second1.3"></a>

In [24]:
num_identified = len(identified_entities)
num_unidentified = len(unidentified_entities)
num_spurious = len(spurious_entites)

In [25]:
#How useful or relevant?
precision = num_identified / (num_identified + num_spurious)
print('Precision: ',precision)

Precision:  0.9576291942199363


In [26]:
#How complete?
recall = num_identified / (num_identified + num_unidentified)
print('Recall: ', recall)

Recall:  0.8771733034212003


## 4) Results <a class="anchor" id="second1.4"></a>

- When testing model 2 (our reference is model 1) we can see that the **_precision is higher than recall_**. 
- **Precision**, measures the total hits (correctly idendified entities) divided by all retrieved entities (relevant and not relevant). In other words, it refers to the percentage of the results which are relevant and, since it is high, we can speculate that this is a better model.
- **Recall**, on the other hand, measures the total hits (correctly idendified entities) divided by all relevant entities (retrieved and not retrieved); in other words, it refers to the percentage of total relevant results correctly classified by the test model.
- A large number of entities unidentified by model 2 were from the type "PERSON" (506 in total), followed by "ORG" (436 in total). This suggests that, if we are more interested in extracting "PERSON" and "ORG" accurately, model 1 would be a better model. 

**Total number of entities in each category with Model 1 as reference and Model 2 as test:**

In [27]:
# Print results
print('Correct identified entities: ', num_identified)
print('Correct unidentified entities: ', num_unidentified)
print('Spurious entities identified:  ', num_spurious)

Correct identified entities:  7820
Correct unidentified entities:  1095
Spurious entities identified:   346


**Total number and type of unidentified entities by Model 2:**

In [28]:
# function to count entities 
def get_frequencies(list_):
    dict_ = {}
    for entity in list_:
        if entity not in dict_.keys():
            dict_[entity] = 1
        else:
            dict_[entity]+=1
            
    dict_sorted = {k: v for k, v in sorted(dict_.items(), key=lambda item: item[1], reverse=True)}
    
    return dict_sorted

In [29]:
# Take a look at unidentified entities in model 2
unidentified_entities_text = [ent.text for ent in unidentified_entities]
print(get_frequencies(unidentified_entities_text))



In [30]:
unidentified_entities_label = [ent.label_ for ent in unidentified_entities]
print(get_frequencies(unidentified_entities_label))

{'PERSON': 506, 'ORG': 436, 'GPE': 134, 'LOC': 19}


**Total number and type of spurious entities by Model 2:**

In [31]:
# Take a look at spurious
spurious_entities_text = [ent.text for ent in spurious_entites]
print(get_frequencies(spurious_entities_text))

{'Labour': 17, 'Tory': 10, 'TPS': 6, 'Kilroy-Silk': 6, 'Baume': 6, 'Blair': 4, 'Parker Bowles': 4, 'Jeeves': 4, 'TGWU': 4, 'Umbro': 4, 'Dr Gibbons': 4, 'the Lib Dems': 3, 'MSPs': 3, 'Home': 3, 'Hockney': 3, 'MPs': 2, 'MoD': 2, 'Eurozone': 2, 'MP': 2, 'Parliamentary': 2, 'IHRC': 2, 'Royal': 2, 'Archbishop': 2, 'Canterbury': 2, 'NAO': 2, 'Parmalat': 2, 'Straw': 2, 'Malik': 2, 'Chidambaram': 2, 'Midlands': 2, 'Maran': 2, 'FDI': 2, 'Mittal': 2, 'Dr Fox': 2, 'MTFG': 2, 'Westminster': 2, 'Nath': 2, 'NHS': 2, 'Lord Falconer': 2, "Dr Gibbons'": 2, 'Alfa': 2, 'Asbos': 2, 'AIG': 1, 'Bewlay': 1, 'Atinc Ozkan': 1, 'Mark Ballard': 1, 'Guantanamo': 1, "BBC One's Breakfast": 1, 'Wastealot': 1, 'Wellington': 1, "Plaid Cymru's": 1, 'German': 1, 'University College London': 1, 'VW': 1, 'the Mercedes Benz C180 SE': 1, 'The Lib Dems': 1, 'May.': 1, 'Bill': 1, 'Children': 1, 'Margaret Hodge': 1, 'Boothroyd': 1, 'Commons Speaker': 1, 'BBC1': 1, 'Easter': 1, 'Docklands': 1, 'the White City': 1, 'SIFF': 1, 'M

In [32]:
spurious_entities_label = [ent.label_ for ent in spurious_entites]
print(get_frequencies(spurious_entities_label))

{'ORG': 177, 'PERSON': 130, 'GPE': 33, 'LOC': 6}
