# Analysis: Named People
## Post Annotation and Aggregation

A comparison of automated Named Entity Recognition and manual annotation

***

**Table of Contents**

  [I. Loading](#load)

  [II. Named Entity Recognition with SpaCy](#ner)
  
  [III. Manual Annotation of People's Names](#annot)
  
  [IV. Comparison](#comp)
  
***

<a id="load"></a>
### I. Loading

In [1]:
# To use custom functions
import utils

# To work with CSV data
import pandas as pd

# To work with TXT data
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# For named entity recognition (NER)
import spacy
from spacy import displacy
from collections import Counter
try:
    import en_core_web_sm
except ImportError:
    print("Downlading en_core_web_sm model")
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

# For fuzzy string matching
# https://github.com/seatgeek/thefuzz
from thefuzz import fuzz, process

# For statistical calculations
import numpy as np

# To export JSON data
import json

Create and load the Plaintext files of archival catalog metadata descriptions used for classification:

In [5]:
df = pd.read_csv("../data/crc_metadata/annot_descs.csv", index_col=0)
subdf = df[["description_id", "description"]]
df_agg = pd.read_csv("../data/aggregated_data/aggregated_final.csv")
df_joined = df_agg.set_index("description_id").join(subdf, on="description_id", how="left")
df_joined.shape
assert len(set(list(df_joined.description_id))) == len(set(list(df_agg.description_id)))

In [6]:
df_desc = df_joined[["description_id", "description"]]
df_desc = df_desc.drop_duplicates()
descs = list(df_desc.description)
assert len(descs) == len(set(list(df_agg.description_id)))

In [7]:
f = open("../data/classified_descriptions.txt", "w") # create a new Plaintext file
counter = 0
for d in descs:
    f.write(d+"\n")
    counter += 1
f.close()
print(counter)

14779


In [8]:
datadir = "../data/"
descs = PlaintextCorpusReader(datadir, "classified_descriptions.txt")

In [9]:
tokens = descs.words()
print(tokens[0:20])

['Biographical', '/', 'Historical', ':', 'Sir', 'John', 'Scott', 'of', 'Scotstarvit', 'was', 'born', 'in', '1585', '.', 'He', 'was', 'the', 'brother', '-', 'in']


In [10]:
sentences = descs.sents()
print(sentences[0:5])

[['Biographical', '/', 'Historical', ':', 'Sir', 'John', 'Scott', 'of', 'Scotstarvit', 'was', 'born', 'in', '1585', '.'], ['He', 'was', 'the', 'brother', '-', 'in', '-', 'law', 'of', 'William', 'Drummond', 'of', 'Hawthornden', ',', 'the', 'poet', '.'], ['Scott', 'was', 'educated', 'at', 'St', '.', 'Leonard', "'", 's', 'College', ',', 'St', '.', 'Andrews', ',', 'entering', 'it', 'around', '1600', '.'], ['He', 'then', 'studied', 'abroad', 'before', 'returning', 'to', 'Scotland', 'having', 'been', 'called', 'to', 'the', 'Bar', '.'], ['In', '1611', 'he', 'acquired', 'Tarvet', 'and', 'other', 'land', 'in', 'Fife', 'to', 'which', 'he', 'gave', 'the', 'name', 'Scotstarvet', '.']]


In [11]:
print(len(tokens), len(sentences))

552520 16450


In [20]:
df = df.drop(columns=["file"])
df_joined2 = df_agg.set_index("description_id").join(df, on="description_id", how="left")
df_joined2 = df_joined2[["description_id", "word_count", "sent_count"]].drop_duplicates()
print(sum(list(df_joined2.sent_count)))
print(sum(list(df_joined2.word_count)))

26574
391403


<a id="ner"></a>
## II. Name Entity Recognition with spaCy
Run named entity recognition (NER) to estimate the names in the dataset and get a sense for the value in manually labeling names during the annotation process. 

In [21]:
fileids = descs.fileids()

In [22]:
sentences = []
for fileid in fileids:
    file = descs.raw(fileid)
    sentences += nltk.sent_tokenize(file)
print(len(sentences))

16155


In [27]:
spacy_person_list = []
for s in sentences:
    s.strip()  # remove leading and trailing whitespace
    s_ne = nlp(s)
    for entity in s_ne.ents:
        if entity.label_ == 'PERSON':
            spacy_person_list += [entity.text] 

In [28]:
spacy_unique_persons = list(set(spacy_person_list))
print(len(spacy_person_list), len(spacy_unique_persons))

24026 7815


In [29]:
print(spacy_unique_persons[100:150])

['John Reid', 'M S Bartlett', 'Aunt Gertrude', "Précis de l'art", 'Grandin', 'Steele', 'Broglie', 'Elena L. Grigorenko', 'P. De Sousa', 'Clorinda', 'Cramond', 'Alexander Darroch', 'George Goldie', 'Dr Blyth', 'Kesting', 'David Somervell', 'James Denholm', 'L. Oppenheim', 'Behold', 'Johanna', 'Harvie', 'Thomas Tod', 'Jean-de-Dieu Soult', 'Charlotte Banks', 'Roger H L\n', 'Keith Vickerman', 'Stanley Booth-Clibborn\nScope', 'Muir', 'Scots Suites', 'Inscribed E. Gabritschevsky', 'Variieren', 'Dean', 'Inscribed E. [Forbes', 'Randi', 'Henry Gilbert', 'H. Monteath', 'Lyneham Lad"', 'John Riddell', 'Anne McLaren', 'Arthur Dean', 'Leo Harrison\n', 'Karnovsky', 'Abraham Weinshall', 'John Berry', 'Hydie', 'B. Ed', 'Buenos Aires', 'Gordon Chiled', 'Jean Giles', 'Dean Willis']


Not perfect...some non-person entities labeled such as `Buenos Aires` and `Behold`.

## III. Manual Annotation of People's Names

In [30]:
df = pd.read_csv("../data/aggregated_data/aggregated_final.csv")
df.head()

Unnamed: 0,agg_ann_id,file,text,ann_offsets,label,category,associated_genders,description_id
0,0,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,1,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,2,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,3,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,4,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


In [32]:
df_ppl = df.loc[df.category == "Person-Name"]
df_ppl = df_ppl.drop_duplicates()
df_ppl.head()

Unnamed: 0,agg_ann_id,file,text,ann_offsets,label,category,associated_genders,description_id
7,7,Coll-1036_00500.ann,Mrs Norman Macleod,"(36375, 36393)",Feminine,Person-Name,Unclear,1082
14,14,Coll-1010_00100.ann,Dr. Nelly Renee Deme,"(40, 60)",Unknown,Person-Name,Unclear,855
15,15,Coll-1036_00300.ann,Marjory Kennedy-Fraser,"(14570, 14592)",Unknown,Person-Name,Unclear,1038
16,16,Coll-1036_00300.ann,Marjory Kennedy Fraser,"(14698, 14720)",Unknown,Person-Name,Unclear,1038
17,17,Coll-1036_00300.ann,Marjory Kennedy-Fraser,"(14924, 14946)",Unknown,Person-Name,Unclear,1038


In [47]:
# Some instances of people's names may be annotated more than once with different labels, so 
# remove these duplicate instances when calculating total people
dedup_df_ppl = df_ppl[["file", "text", "category", "ann_offsets", "description_id"]].drop_duplicates()
total_ppl = dedup_df_ppl.shape[0]  # without dedup: 31158
df_mas = df_ppl.loc[df_ppl.label == "Masculine"]
df_fem = df_ppl.loc[df_ppl.label == "Feminine"]
df_unk = df_ppl.loc[df_ppl.label == "Unknown"]
total_mas = df_mas.shape[0]
total_fem = df_fem.shape[0]
total_unk = df_unk.shape[0]
print("Total people:", total_ppl)
print("Total Masculine:", total_mas)
print("Total Feminine:", total_fem)
print("Total Unknown:", total_unk)

Total people: 29384
Total Masculine: 6087
Total Feminine: 1836
Total Unknown: 23234


In [56]:
unique_ppl = set(list(df_ppl.text))
unique_mas = set(list(df_mas.text))
unique_fem = set(list(df_fem.text))
unique_unk = set(list(df_unk.text))
print("Unique people names:", len(unique_ppl))
print("Unique masculine-labeled names:", len(unique_mas))
print("Unique feminine-labeled names:", len(unique_fem))
print("Unique unknown-labeled names:", len(unique_unk))

Unique people names: 10288
Unique masculine-labeled names: 2121
Unique feminine-labeled names: 655
Unique unknown-labeled names: 8316


## IV. Automated Annotation of People's Names
Compare the Person Name annotations of the highest performing Person Name and Occupation classifier (with Linguistic labels as features) to the manual and spaCy annotation of Person Names.

First, join the original text data, from the aggregated dataset, to the classifier's prediction data:

In [48]:
f = "../data/aggregated_data/aggregated_final.csv"
df = pd.read_csv(f)
df = df.loc[df.category == "Person-Name"]
df.head()

Unnamed: 0,agg_ann_id,file,text,ann_offsets,label,category,associated_genders,description_id
7,7,Coll-1036_00500.ann,Mrs Norman Macleod,"(36375, 36393)",Feminine,Person-Name,Unclear,1082
14,14,Coll-1010_00100.ann,Dr. Nelly Renee Deme,"(40, 60)",Unknown,Person-Name,Unclear,855
15,15,Coll-1036_00300.ann,Marjory Kennedy-Fraser,"(14570, 14592)",Unknown,Person-Name,Unclear,1038
16,16,Coll-1036_00300.ann,Marjory Kennedy Fraser,"(14698, 14720)",Unknown,Person-Name,Unclear,1038
17,17,Coll-1036_00300.ann,Marjory Kennedy-Fraser,"(14924, 14946)",Unknown,Person-Name,Unclear,1038


In [49]:
# The baseline PNOC performed best for classifying with the Person Name labels (Experiment 3's first model)
pnoc_pred = "../data/token_clf_data/experiment3/5fold/output/crf-arow_pers_o_baseline_fastText100_annot_evaluation.csv"
df_pnoc = pd.read_csv(pnoc_pred, index_col=0, low_memory=False)
df_pnoc = df_pnoc.drop_duplicates()
df_pnoc.head()

Unnamed: 0,description_id,sentence_id,ann_id,expected_label,predicted_label,_merge
6848,1082,2590,7.0,Feminine,Feminine,true positive
65634,855,1097,14.0,O,Feminine,false positive
2709,855,1097,14.0,Unknown,O,false negative
66267,1038,1485,15.0,O,Feminine,false positive
3709,1038,1485,15.0,Unknown,O,false negative


In [50]:
df_pnoc.shape

(115201, 6)

In [51]:
df = df.rename(columns={"agg_ann_id":"ann_id"})
subdf = df[["ann_id", "text"]]
df_pnoc = subdf.join(df_pnoc.set_index("ann_id"), how="outer", on="ann_id")
df_pnoc.head()

Unnamed: 0,ann_id,text,description_id,sentence_id,expected_label,predicted_label,_merge
7.0,7,Mrs Norman Macleod,1082.0,2590.0,Feminine,Feminine,true positive
14.0,14,Dr. Nelly Renee Deme,855.0,1097.0,O,Feminine,false positive
14.0,14,Dr. Nelly Renee Deme,855.0,1097.0,Unknown,O,false negative
15.0,15,Marjory Kennedy-Fraser,1038.0,1485.0,O,Feminine,false positive
15.0,15,Marjory Kennedy-Fraser,1038.0,1485.0,Unknown,O,false negative


Count the total predicted person names, as well as total predicted feminine, masculine, and unknown names:

In [110]:
clf_df_ppl = df_pnoc.loc[~df_pnoc.predicted_label.isna()]
clf_df_ppl = clf_df_ppl.loc[clf_df_ppl.predicted_label != "O"]
clf_df_fem = clf_df_ppl.loc[clf_df_ppl.predicted_label == "Feminine"]
clf_df_mas = clf_df_ppl.loc[clf_df_ppl.predicted_label == "Masculine"]
clf_df_unk = clf_df_ppl.loc[clf_df_ppl.predicted_label == "Unknown"]

# Note: Some instances of people's names may be annotated more than once with different labels
clf_total_ppl = clf_df_ppl.shape[0]
clf_total_fem = clf_df_fem.shape[0]
clf_total_mas = clf_df_mas.shape[0]
clf_total_unk = clf_df_unk.shape[0]

print("Total people:", clf_total_ppl)
print("Total Masculine:", clf_total_mas)
print("Total Feminine:", clf_total_fem)
print("Total Unknown:", clf_total_unk)

Total people: 20304
Total Masculine: 3835
Total Feminine: 1724
Total Unknown: 12042


In [111]:
clf_unique_ppl = set(list(clf_df_ppl.text))
# Remove nan values
clf_unique_ppl = list(clf_unique_ppl)
clf_unique_people = [person for person in clf_unique_ppl if type(person) == str]

clf_unique_mas = set(list(clf_df_mas.text))
clf_unique_fem = set(list(clf_df_fem.text))
clf_unique_unk = set(list(clf_df_unk.text))
print("Unique people names:", len(clf_unique_people))
print("Unique masculine-labeled names:", len(clf_unique_mas))
print("Unique feminine-labeled names:", len(clf_unique_fem))
print("Unique unknown-labeled names:", len(clf_unique_unk))

Unique people names: 4305
Unique masculine-labeled names: 1259
Unique feminine-labeled names: 414
Unique unknown-labeled names: 3167


Compare with the expected (manually-annotated) Person Name labels:

In [129]:
df_ppl = df_pnoc.loc[~df_pnoc.expected_label.isna()]
df_ppl = df_ppl.loc[df_ppl.expected_label != "O"]
df_fem = df_ppl.loc[df_ppl.expected_label == "Feminine"]
df_mas = df_ppl.loc[df_ppl.expected_label == "Masculine"]
df_unk = df_ppl.loc[df_ppl.expected_label == "Unknown"]

unique_ppl = set(list(df_ppl.text))
unique_mas = set(list(df_mas.text))
unique_fem = set(list(df_fem.text))
unique_unk = set(list(df_unk.text))
print("Unique people names:", len(unique_ppl))
print("Unique masculine-labeled names:", len(unique_mas))
print("Unique feminine-labeled names:", len(unique_fem))
print("Unique unknown-labeled names:", len(unique_unk))

Unique people names: 6460
Unique masculine-labeled names: 1847
Unique feminine-labeled names: 537
Unique unknown-labeled names: 4808


<a id="comp"></a>
## IV. Comparison

Compare the number of unique person names spaCy found to those the annotators and classifier found with exact and fuzzy string matching.

In [130]:
print("Unique people names in spaCy:  ", len(spacy_unique_persons))
print("Unique people names annotated:", len(unique_ppl))
print("Unique people names classified:", len(clf_unique_people))

Unique people names in spaCy:   7815
Unique people names annotated: 6460
Unique people names classified: 4305


More names of people were labeled during annotation than with spaCy, but...

In [139]:
exact_match_ann = [person_name for person_name in unique_ppl if person_name in spacy_unique_persons]
exact_match_clf = [person_name for person_name in clf_unique_people if person_name in spacy_unique_persons]
exact_match_ann_clf = [person_name for person_name in clf_unique_people if person_name in unique_ppl]
print("Manually-annotated names in spaCy names:", len(exact_match_ann))
print("Classifier-annotated names in spaCy", len(exact_match_clf))
print("Classifier-annotated names in manual-annotated names", len(exact_match_ann_clf))

Manually-annotated names in spaCy names: 2621
Classifier-annotated names in spaCy 1732
Classifier-annotated names in manual-annotated names 4305


In [132]:
fem_match = [n for n in unique_fem if n in spacy_unique_persons]
mas_match = [n for n in unique_mas if n in spacy_unique_persons]
unk_match = [n for n in unique_unk if n in spacy_unique_persons]
print("Feminine-labeled names found by spaCy:", len(fem_match))
print("Masculine-labeled names found by spaCy:", len(mas_match))
print("Unknown-labeled names found by spaCy:", len(unk_match))

Feminine-labeled names found by spaCy: 227
Masculine-labeled names found by spaCy: 676
Unknown-labeled names found by spaCy: 2195


In [133]:
fem_match = [n for n in unique_fem if n in clf_unique_fem]  #clf_unique_people] #433 (count where grammatical gender matched)
mas_match = [n for n in unique_mas if n in clf_unique_mas]  #clf_unique_people] 1421 (count where grammatical gender matched)
unk_match = [n for n in unique_unk if n in clf_unique_unk]  #clf_unique_people] 3118 (count where grammatical gender matched)
print("Feminine-labeled names found by own classifier:", len(fem_match))
print("Masculine-labeled names found by own classifier:", len(mas_match))
print("Unknown-labeled names found by own classifier:", len(unk_match))

Feminine-labeled names found by own classifier: 308
Masculine-labeled names found by own classifier: 893
Unknown-labeled names found by own classifier: 2695


#### Fuzzy String Matching
Evaluate overlaps more loosely using fuzzy string matching.

In [168]:
# Remove any non-string values
spacy_unique_persons = [n for n in spacy_unique_persons if type(n) == str]
unique_ppl = [n for n in unique_ppl if type(n) == str]
clf_unique_people = [n for n in clf_unique_people if type(n) == str]
print(len(spacy_unique_persons), len(unique_ppl), len(clf_unique_people))

7815 6459 4305


In [137]:
# Compare each manually annotated person name to all spaCy-labeled person names
def getAnnotFuzzyMatches(score_method, min_score):
    all_fuzzy_matches = []
    no_fuzzy_match = 0
    for n in unique_ppl:
        fuzzy_matches = process.extractBests(n, spacy_unique_persons, scorer=score_method, score_cutoff=min_score)
        if len(fuzzy_matches) == 0:
            no_fuzzy_match += 1
        else:
            all_fuzzy_matches = all_fuzzy_matches + fuzzy_matches
    return no_fuzzy_match, all_fuzzy_matches

# Compare each manually annotated person name to all classifier-labeled person names
def getAnnotFuzzyMatcheswithClf(score_method, min_score):
    all_fuzzy_matches = []
    no_fuzzy_match = 0
    for n in unique_ppl:
        fuzzy_matches = process.extractBests(n, clf_unique_people, scorer=score_method, score_cutoff=min_score)
        if len(fuzzy_matches) == 0:
            no_fuzzy_match += 1
        else:
            all_fuzzy_matches = all_fuzzy_matches + fuzzy_matches
    return no_fuzzy_match, all_fuzzy_matches

# Compare each classified person name to all spaCy-labeled person names
def getClfFuzzyMatches(score_method, min_score):
    all_fuzzy_matches = []
    no_fuzzy_match = 0
    for n in clf_unique_people:
        fuzzy_matches = process.extractBests(n, spacy_unique_persons, scorer=score_method, score_cutoff=min_score)
        if len(fuzzy_matches) == 0:
            no_fuzzy_match += 1
        else:
            all_fuzzy_matches = all_fuzzy_matches + fuzzy_matches
    return no_fuzzy_match, all_fuzzy_matches

# Compare each spaCy-labeled person name to all manually annotated person names
def getSpacyFuzzyMatches(score_method, min_score):
    all_fuzzy_matches = []
    no_fuzzy_match = 0
    for n in spacy_unique_persons:
        fuzzy_matches = process.extractBests(n, unique_ppl, scorer=score_method, score_cutoff=min_score)
        if len(fuzzy_matches) == 0:
            no_fuzzy_match += 1
        else:
            all_fuzzy_matches = all_fuzzy_matches + fuzzy_matches
    return no_fuzzy_match, all_fuzzy_matches

# Compare each spaCy-labeled person name to all classified person names
def getSpacyFuzzyMatchesWithClf(score_method, min_score):
    all_fuzzy_matches = []
    no_fuzzy_match = 0
    for n in spacy_unique_persons:
        fuzzy_matches = process.extractBests(n, clf_unique_people, scorer=score_method, score_cutoff=min_score)
        if len(fuzzy_matches) == 0:
            no_fuzzy_match += 1
        else:
            all_fuzzy_matches = all_fuzzy_matches + fuzzy_matches
    return no_fuzzy_match, all_fuzzy_matches

In [140]:
score_method = fuzz.ratio
min_score = 90
no_fuzzy_match, all_fuzzy_matches = getAnnotFuzzyMatches(score_method, min_score)
print("Count of annotated person names without spaCy fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Count of annotated person names without spaCy fuzzy matching ratios of at least 90: 3024


Let's calculate the minimum, maximum, and average fuzzy matching ratios of manually annotated person names to spaCy person names: 

In [141]:
score_method = fuzz.ratio
scores = []
for n in unique_ppl:
    fuzzy_matches = process.extractOne(n, spacy_unique_persons, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [142]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)
at_least_mean_score = [score for score in scores if score >= mean_score]

In [143]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with mean score ("+str(mean_score)+") or higher:", len(at_least_mean_score))

Mean score: 87.26428239665583
Matches with minimum score of 41: 2
Matches with maximum score of 100: 3048
Matches with mean score (87.26428239665583) or higher: 3671


Now calculate the minimum, maximum, and average fuzzy string matching ratios of classified person names to manually annotated person names:

In [144]:
score_method = fuzz.ratio
# no_fuzzy_match, all_fuzzy_matches = getSpacyFuzzyMatchesWithClf(score_method, mean_score)
# print("Fuzzy mathes with score at least"+str(mean_score)+":",all_fuzzy_matches)
# print("Fuzzy mathes with score less than"+str(mean_score)+":",no_fuzzy_matches)
scores = []
for n in unique_ppl:
    fuzzy_matches = process.extractOne(n, clf_unique_people, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [145]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)
at_least_mean_score = [score for score in scores if score >= mean_score]

In [146]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with mean score ("+str(mean_score)+") or higher:", len(at_least_mean_score))

Mean score: 90.00959900913455
Matches with minimum score of 36: 1
Matches with maximum score of 100: 4329
Matches with mean score (90.00959900913455) or higher: 4416


Now let's do the reverse... 

#### How many of the spaCy person names appear in the manually annotated and classified person names?

In [147]:
ann_exact_match = [person_name for person_name in spacy_unique_persons if person_name in unique_ppl]
clf_exact_match = [person_name for person_name in spacy_unique_persons if person_name in clf_unique_people]
print(len(ann_exact_match))
print(len(clf_exact_match))

2621
1732


In [148]:
# score_method = fuzz.ratio
# min_score = 90
# no_fuzzy_match, all_fuzzy_matches = getSpacyFuzzyMatches(score_method, min_score)
# print("Count of spaCy names without annotated person name fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

In [149]:
# score_method = fuzz.ratio
# min_score = 75
# no_fuzzy_match, all_fuzzy_matches = getSpacyFuzzyMatches(score_method, min_score)
# print("Count of spaCy names without annotated person name fuzzy matching ratios of at least {s}: {m}".format(s=min_score,m=no_fuzzy_match))

Let's calculate the minimum, maximum, and average fuzzy matching ratios of spaCy person names to manually annotated person names: 

In [150]:
score_method = fuzz.ratio
scores = []
for n in spacy_unique_persons:
    fuzzy_matches = process.extractOne(n, unique_ppl, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [151]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)
at_least_mean_score = [score for score in scores if score >= mean_score]

In [152]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with mean score ("+str(mean_score)+") or higher:", len(at_least_mean_score))

Mean score: 81.23659628918746
Matches with minimum score of 14: 1
Matches with maximum score of 100: 2818
Matches with mean score (81.23659628918746) or higher: 3749


Now claculate the minimum, maximum, and average fuzzy string matching ratios of spaCy person names to classified person names:

In [153]:
score_method = fuzz.ratio
scores = []
for n in spacy_unique_persons:
    fuzzy_matches = process.extractOne(n, clf_unique_people, scorer=score_method)  # use default score_cutoff, which is 0
    scores += [fuzzy_matches[1]]  # first position in tuple is match string, second position in tuple is score

In [154]:
min_score = np.min(scores)
max_score = np.max(scores)
mean_score = np.mean(scores)
median_score = np.median(scores)
at_least_mean_score = [score for score in scores if score >= mean_score]

In [155]:
# Get the counts (occurrences) of each score
unique_scores, score_counts = np.unique(scores, return_counts=True)
score_counts = dict(zip(unique_scores, score_counts))
print("Mean score: "+str(mean_score))
print("Matches with minimum score of "+str(min_score)+":", score_counts[min_score])
print("Matches with maximum score of "+str(max_score)+":", score_counts[max_score])
print("Matches with mean score ("+str(mean_score)+") or higher:", len(at_least_mean_score))

Mean score: 75.82072936660269
Matches with minimum score of 14: 1
Matches with maximum score of 100: 1889
Matches with mean score (75.82072936660269) or higher: 3296


Calculate fuzzy matches with a score cutoff of 80 between:
* spaCy and manual annotations
* classifier and manual annotations
* spaCy and classifier annotations

In [156]:
min_score = 80

In [166]:
# Compare each manually annotated person name to all spaCy-labeled person names
# manual_to_spacy_no_match, manual_to_spacy_all_match = getAnnotFuzzyMatches(score_method, min_score)
print(manual_to_spacy_no_match, len(set(manual_to_spacy_all_match)))

2025 6449


In [165]:
# Compare each manually annotated person name to classifier-labeled person names
# manaul_to_clf_no_match, manual_to_clf_all_match = getAnnotFuzzyMatcheswithClf(score_method, min_score)
print(manaul_to_clf_no_match, len(set(manual_to_clf_all_match)))

1770 7823


In [164]:
# Compare each spaCy annotated person name to classifier-labeled person names
# spacy_to_clf_no_match, spacy_to_clf_all_match = getSpacyFuzzyMatchesWithClf(score_method, min_score)
print(spacy_to_clf_no_match, len(set(spacy_to_clf_all_match)))

4895 4916


In [None]:
# # Convert numpty ints to python ints for JSON file writing
# unique_scores = [int(s) for s in unique_scores]
# score_counts = [int(c) for c in score_counts]

# d_array = []
# i, maxI = 0, len(unique_scores)
# while i < maxI:
#     d = dict()
#     d["unique_score"] = unique_scores[i]
#     d["count"] = score_counts[i]
#     d_array = d_array + [d]
#     i += 1

# print(d_array)

In [None]:
# score_counts_json = json.dumps(d_array)
# f = open("analysis_data/spacy_to_annot_ppl_fuzzy_ratios.json", "w")
# f.write(score_counts_json)
# f.close()