# Analysis: Named People
## Post Annotation and Aggregation

A comparison of automated Named Entity Recognition and manual annotation

***

**Table of Contents**

  [I. Loading](#load)

  [II. Named Entity Recognition with SpaCy](#ner)
  
  [III. Manual Annotation of People's Names](#annot)
  
  [IV. Comparison](#comp)
  
***

<a id="load"></a>
### I. Loading

In [4]:
# To use custom functions
import utils

# To work with CSV data
import pandas as pd

# To work with TXT data
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# For named entity recognition (NER)
import spacy
from spacy import displacy
from collections import Counter
try:
    import en_core_web_sm
except ImportError:
    print("Downlading en_core_web_sm model")
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

# For fuzzy string matching
# https://github.com/seatgeek/thefuzz
from thefuzz import fuzz, process

# For statistical calculations
import numpy as np

# To export JSON data
import json

Load the Plaintext files of archival catalog metadata descriptions:

In [11]:
datadir = "data/"
descs = PlaintextCorpusReader(datadir+"descriptions", ".+\.txt")

In [12]:
tokens = descs.words()
print(tokens[0:20])

['Professor', 'James', 'Aitken', 'White', 'was', 'a', 'leading', 'Scottish', 'Theologian', 'and', 'Moderator', 'of', 'the', 'General', 'Assembly', 'of', 'the', 'Church', 'of', 'Scotland']


In [13]:
sentences = descs.sents()
print(sentences[0:5])

[['Professor', 'James', 'Aitken', 'White', 'was', 'a', 'leading', 'Scottish', 'Theologian', 'and', 'Moderator', 'of', 'the', 'General', 'Assembly', 'of', 'the', 'Church', 'of', 'Scotland', '.'], ['He', 'was', 'educated', 'at', 'Daniel', 'Stewart', "'", 's', 'College', 'and', 'the', 'University', 'of', 'Edinburgh', 'where', 'he', 'studied', 'philosophy', 'and', 'divinity', '.'], ['After', 'his', 'ordination', 'he', 'spent', 'three', 'years', 'as', 'an', 'army', 'Chaplain', 'and', 'then', 'in', '1948', 'was', 'inducted', 'to', 'Dunollie', 'Road', 'Church', 'in', 'Oban', '.'], ['James', 'Whyte', 'moved', 'to', 'Mayfield', 'North', 'Church', 'in', 'Edinburgh', 'in', '1954', 'and', 'in', '1958', 'was', 'appointed', 'to', 'the', 'chair', 'of', 'practical', 'theology', 'and', 'Christian', 'ethics', 'at', 'the', 'University', 'of', 'St', 'Andrew', "'", 's', 'where', 'he', 'remained', 'until', '1987', '.'], ['His', 'primary', 'interests', 'were', 'in', 'liturgy', 'and', 'ecclesiastical', 'archi

<a id="ner"></a>
## II. Name Entity Recognition with spaCy
Run named entity recognition (NER) to estimate the names in the dataset and get a sense for the value in manually labeling names during the annotation process. 

In [14]:
fileids = descs.fileids()

In [15]:
sentences = []
for fileid in fileids:
    file = descs.raw(fileid)
    sentences += nltk.sent_tokenize(file)

In [16]:
person_list = []
for s in sentences:
    s_ne = nlp(s)
    for entity in s_ne.ents:
        if entity.label_ == 'PERSON':
            person_list += [entity.text] 

In [17]:
unique_persons = list(set(person_list))
print(len(unique_persons))

7867


In [18]:
print(unique_persons[100:150])

['William Ord', 'Willie Johnston', 'W. J. Sedgefield', 'Herbert Mather Spoor', 'Leonora Vigoleno', 'Robert J. Sternberg', 'Johann Wolfgang von', 'Bencher', 'John Jones', 'xiv+448', 'Gomirato', 'Sheila', 'Copia Inventarii', 'Michael Wynne', 'John D. Sutherland', "Godfrey H Thomson's", 'Sphagnum', 'Walter Scottmanuscript', 'S. Blott', 'Robert Baillie', 'Dick Crossman', 'Smertenko', 'Syme', 'Fenwick', 'David Sainsbury', 'Ambler', 'Don', 'James D. Macgregor', 'Joseph W. Hills', 'OcynSontag', 'Peter Sharp', 'Anita', 'Andrew Broomhall', 'Basil Spence', 'Harold Braley]', 'Rev John Baillie', 'Ian Gilmour', 'Lily Jackson', 'George Stephen', 'K. M.', 'John Rhodes', 'Lucy', 'Stan Trevor', 'E. R. WardKammerer', 'Bernadotte', 'James Maidment', 'Micheal', 'Carl', 'M. Ritchie', 'James S. K. Elmslie']


Not perfect...some non-person entities labeled such as `JerusalemPalestineSelzer` and `Arithmetique`.

In [19]:
print(len(person_list))

24438


## III. Manual Annotation of People's Names

In [5]:
df = pd.read_csv(datadir+"aggregated_final.csv", index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,Generalization,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,Generalization,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,Generalization,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,Generalization,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,Generalization,Linguistic


In [6]:
df_ppl = df.loc[df.category == "Person-Name"]
df_ppl.head()

Unnamed: 0,file,offsets,text,label,category
31,Coll-1036_00500.ann,"(36375, 36393)",Mrs Norman Macleod,Feminine,Person-Name
53,Coll-1010_00100.ann,"(40, 60)",Dr. Nelly Renee Deme,Unknown,Person-Name
54,Coll-1036_00300.ann,"(14570, 14592)",Marjory Kennedy-Fraser,Unknown,Person-Name
55,Coll-1036_00300.ann,"(14698, 14720)",Marjory Kennedy Fraser,Unknown,Person-Name
56,Coll-1036_00300.ann,"(14924, 14946)",Marjory Kennedy-Fraser,Unknown,Person-Name


In [9]:
total_ppl = df_ppl.shape[0]
df_mas = df_ppl.loc[df_ppl.label == "Masculine"]
df_fem = df_ppl.loc[df_ppl.label == "Feminine"]
df_unk = df_ppl.loc[df_ppl.label == "Unknown"]
total_mas = df_mas.shape[0]
total_fem = df_fem.shape[0]
total_unk = df_unk.shape[0]
print("Total people:", total_ppl)
print("Total Masculine:", total_mas)
print("Total Feminine:", total_fem)
print("Total Unknown:", total_unk)

Total people: 31502
Total Masculine: 6087
Total Feminine: 1836
Total Unknown: 23234


In [10]:
unique_ppl = set(list(df_ppl.text))
unique_mas = set(list(df_mas.text))
unique_fem = set(list(df_fem.text))
unique_unk = set(list(df_unk.text))
print("Unique people names:", len(unique_ppl))
print("Unique masculine-labeled names:", len(unique_mas))
print("Unique feminine-labeled names:", len(unique_fem))
print("Unique unknown-labeled names:", len(unique_unk))

Unique people names: 10294
Unique masculine-labeled names: 2121
Unique feminine-labeled names: 655
Unique unknown-labeled names: 8316


<a id="comp"></a>
## IV. Comparison

Compare the number of unique and total people spaCy found to those the annotators found with exact and fuzzy string matching.

In [15]:
print("Total people names in spaCy: ", len(person_list))
print("Total people names annotated:", total_ppl)
print("\nUnique people names in spaCy:  ", len(unique_persons))
print("Unique people names annotated:", len(unique_ppl))

Total people names in spaCy:  24438
Total people names annotated: 31502

Unique people names in spaCy:   7867
Unique people names annotated: 10294


More names of people were labeled during annotation than with spaCy, but...

#### How many of the manually-annotated names are included in the spaCy labels?

In [16]:
exact_match = [person_name for person_name in unique_ppl if person_name in unique_persons]
print(len(exact_match))

3210


In [21]:
fem_match = [n for n in unique_fem if n in unique_persons]
mas_match = [n for n in unique_mas if n in unique_persons]
unk_match = [n for n in unique_unk if n in unique_persons]
print("Feminine-labeled names found by spaCy:", len(fem_match))
print("Masculine-labeled names found by spaCy:", len(mas_match))
print("Unknown-labeled names found by spaCy:", len(unk_match))

Feminine-labeled names found by spaCy: 253
Masculine-labeled names found by spaCy: 747
Unknown-labeled names found by spaCy: 2733


In [35]:
score_method = fuzz.ratio
min_score = 90
no_fuzzy_match, all_fuzzy_matches = utils.getAnnotFuzzyMatches(score_method, min_score)
print("Count of annotated person names without spaCy fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Count of annotated person names without spaCy fuzzy matching ratios of at least 38: 4175


In [34]:
score_method = fuzz.ratio
min_score = 75
no_fuzzy_match, all_fuzzy_matches = utils.getAnnotFuzzyMatches(score_method, min_score)
print("Count of annotated person names without SpaCy fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Count of annotated person names without SpaCy fuzzy matching ratios of at least 38: 4175


Let's calculate the minimum, maximum, and average fuzzy matching ratios of manually annotated person names to spaCy person names: 

In [29]:
score_method = fuzz.ratio
scores = []
for n in unique_ppl:
    fuzzy_matches = process.extractOne(n, unique_persons, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [30]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)

In [32]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with median score of "+str(median_score)+":", score_counts[median_score])

Mean score: 82.00699436564989
Matches with minimum score of 38: 1
Matches with maximum score of 100: 3619
Matches with median score of 80.0: 249


Now let's do the reverse... 

#### How many of the spaCy person names appear in the manually annotated person names?

In [43]:
exact_match = [person_name for person_name in unique_persons if person_name in unique_ppl]
print(len(exact_match))

3210


In [37]:
score_method = fuzz.ratio
min_score = 90
no_fuzzy_match, all_fuzzy_matches = utils.getSpacyFuzzyMatches(score_method, min_score)
print("Count of spaCy names without annotated person name fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Count of spaCy names without annotated person name fuzzy matching ratios of at least 90: 4252


In [39]:
score_method = fuzz.ratio
min_score = 75
no_fuzzy_match, all_fuzzy_matches = utils.getSpacyFuzzyMatches(score_method, min_score)
print("Count of spaCy names without annotated person name fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Count of spaCy names without annotated person name fuzzy matching ratios of at least 75: 2570


Let's calculate the minimum, maximum, and average fuzzy matching ratios of spaCy person names to manually annotated person names: 

In [40]:
score_method = fuzz.ratio
scores = []
for n in unique_persons:
    fuzzy_matches = process.extractOne(n, unique_ppl, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [41]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)

In [47]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with median score of "+str(median_score)+":", score_counts[median_score])

Mean score: 84.14859539850006
Matches with minimum score of 14: 1
Matches with maximum score of 100: 3286
Matches with median score of 86.0: 129


In [49]:
# Convert numpty ints to python ints for JSON file writing
unique_scores = [int(s) for s in unique_scores]
score_counts = [int(c) for c in score_counts]

d_array = []
i, maxI = 0, len(unique_scores)
while i < maxI:
    d = dict()
    d["unique_score"] = unique_scores[i]
    d["count"] = score_counts[i]
    d_array = d_array + [d]
    i += 1

print(d_array)

[{'unique_score': 14, 'count': 14}, {'unique_score': 24, 'count': 24}, {'unique_score': 25, 'count': 25}, {'unique_score': 33, 'count': 33}, {'unique_score': 36, 'count': 36}, {'unique_score': 38, 'count': 38}, {'unique_score': 40, 'count': 40}, {'unique_score': 42, 'count': 42}, {'unique_score': 43, 'count': 43}, {'unique_score': 44, 'count': 44}, {'unique_score': 46, 'count': 46}, {'unique_score': 47, 'count': 47}, {'unique_score': 48, 'count': 48}, {'unique_score': 49, 'count': 49}, {'unique_score': 50, 'count': 50}, {'unique_score': 51, 'count': 51}, {'unique_score': 52, 'count': 52}, {'unique_score': 53, 'count': 53}, {'unique_score': 54, 'count': 54}, {'unique_score': 55, 'count': 55}, {'unique_score': 56, 'count': 56}, {'unique_score': 57, 'count': 57}, {'unique_score': 58, 'count': 58}, {'unique_score': 59, 'count': 59}, {'unique_score': 60, 'count': 60}, {'unique_score': 61, 'count': 61}, {'unique_score': 62, 'count': 62}, {'unique_score': 63, 'count': 63}, {'unique_score': 64

In [50]:
score_counts_json = json.dumps(d_array)
f = open("analysis_data/spacy_to_annot_ppl_fuzzy_ratios.json", "w")
f.write(score_counts_json)
f.close()