This Notebook is used to qualitatively and quantitatively evaluate the NER labeling done by the fine-tuned BERT. This evaluation is done against the baseline metrics (seen in `baselineBERT.ipynb`) and the labels done by an NER model that has been giving high praises in the field (Stanford NER).

I'm comparing how closely my fine-tuned model performs against the Stanford NER and whether both perform better than the baseline. At the end of the notebook, I note a major asterisk of my evaluation.
- Since I do not have a gold standard NER labes (my data is new and has not undergone human annotation), I'm using Stanford NER's labeling as a frame of reference, as Stanford NER is a renowned NER tool. I assume that Stanford NER's labels are correct (like a "silver standard" rather than a "gold standard") and see how my model perform against Stanford NER's performance on the data.

I have noted each of the comparison and evaluation that I'm running, and I have a final conclusion in the end.

Code attribution:
All code in this Notebook is mine.

In [None]:
# Load Stanford NER:

import json
from nltk.tag import StanfordNERTagger

def stanfordNERTag(jsonContentPath, stanfordTagger):
    with open(jsonContentPath, "r") as jsonFile:
        jsonContentUntagged = json.load(jsonFile)
    
    taggingResultDict = {}

    progressCounter = 0
    totalNeedToProcess = len(jsonContentUntagged)
    for newspaperName, newspaperContent in jsonContentUntagged.items():
        taggingResultDict[newspaperName] = stanfordTagger.tag(newspaperContent.split())
        progressCounter += 1
        print(f"Processed {progressCounter}/{totalNeedToProcess} files")

    with open("/Users/Jerry/Desktop/CS372/FinalProject/data/stanfordTaggingMerged(UseInEvaluation).json", "w") as jsonTagResultFile:
        json.dump(taggingResultDict, jsonTagResultFile, indent=4)
    
jsonContentPath = "/Users/Jerry/Desktop/CS372/FinalProject/data/newspaperCleanedContent.json"
stanfordTagger = StanfordNERTagger('/Users/Jerry/Desktop/CS372/stanford-ner-2020-11-17/classifiers/english.muc.7class.distsim.crf.ser.gz', '/Users/Jerry/Desktop/CS372/stanford-ner-2020-11-17/stanford-ner-4.2.0.jar')

taggingResults = stanfordNERTag(jsonContentPath, stanfordTagger)

In [5]:
# We will first compare the number of labels in both mine and Stanford's against the baseline (7%)

# This function is a reusable function that finds the number of words in our corpus
def findNumberOfWords(dataStorageJSON):
    with open(dataStorageJSON, "r") as contentJSON:
        rawContent = json.load(contentJSON)
    
    wordCount = {}
    for newspaperTitle, newspaperContent in rawContent.items():
        wordCount[newspaperTitle] = len(newspaperContent.split())
    
    totalNumberOfWords = sum([newspaperWordCount for newspaperTitle, newspaperWordCount in wordCount.items()])

    return wordCount, totalNumberOfWords

# This is a reusable function that finds the number of labels performed by a model
def countNumberOfLabels(NERStorageJSON):
    with open(NERStorageJSON, "r") as storageJSON:
        NERContent = json.load(storageJSON)
    
    labelCounter = {}
    for newspaperTitle, newspaperContent in NERContent.items():
        labelCounter[newspaperTitle] = len(newspaperContent)
    
    totalNumberOfLabels = sum([newspaperLabelCount for newspaperTitle, newspaperLabelCount in labelCounter.items()])

    return labelCounter, totalNumberOfLabels

In [6]:
# Run the two functions above

# loads the raw data
rawDataJSON = "/Users/Jerry/Desktop/CS372/FinalProject/data/newspaperCleanedContent.json"

# Loads the labels done by Stanford NER (to compare against)
stanfordNERJSON = "/Users/Jerry/Desktop/CS372/FinalProject/data/stanfordTaggingMerged(UseInEvaluation).json"

# Load the labels done by my model
projectNERJSON = "/Users/Jerry/Desktop/CS372/FinalProject/notebooks/entitiesOutput.json"

In [7]:
wordCountByDoc, totalWordCount = findNumberOfWords(rawDataJSON)

print(f"Total number of words in corpus: {totalWordCount}")

stanfordLabelCountDoc, stanfordTotalLabelCount = countNumberOfLabels(stanfordNERJSON)
projectLabelCountDoc, projectTotalLabelCount = countNumberOfLabels(projectNERJSON)

print(f"Stanford NER found: {stanfordTotalLabelCount} labels in total")
print(f"My model found: {projectTotalLabelCount} labels in total")

Total number of words in corpus: 1546660
Stanford NER found: 51951 labels in total
My model found: 8347 labels in total


### Evaluation 1: number of labels

This shows whether my model has found as many labels as the Stanford NER.

In [8]:
# Now determine whether I have achieved more than 7%

stanfordPercentage = (stanfordTotalLabelCount / totalWordCount) * 100
projectPercentage = (projectTotalLabelCount / totalWordCount) *100

print(f"Stanford NER has labeled {stanfordPercentage}% of words as Named Entites")
print(f"My model has labeled {projectPercentage}% of words as Named Entities ")

Stanford NER has labeled 3.358915340152328% of words as Named Entites
My model has labeled 0.5396790503407343% of words as Named Entities 


Both mine and Stanford NER has labeled less words than the baseline. This does not necessarily mean that the Stanford NER is bad because we are only considering three types of Named Entites (Person, Location, and Organization) while the baseline considers all types of Named Entities.

However, the fact that my own model has labeled much less Named Entites than Stanford's shows that my model is not performing as well as Stanford's.

This shows that my model has identified less Named Entities. This is an example of having low `recall` score.

### Evaluation 2: consistency of labels

This shows whether my model is consistent in marking a word with a specific way regardless of how many times the word appears. This evaluation is built on the premise that Named Entites have specific meanings that do not change much regardless of usage.

In [9]:
# Looking at consistency in labeling Stanford:

# This function turns a labeled JSON into a dictionary structured: {"NERLabel": ["NE1", "NE2"]...}
def turnIntoLabelsContentDict(jsonFile):
    with open(jsonFile, "r") as NERFile:
        nerContent = json.load(NERFile)
    
    labelDict = {}
    for newspaperTitle, newspaperContent in nerContent.items():
        for text, label in newspaperContent:
            if label in ["ORGANIZATION", "LOCATION", "PERSON"]:
                if label not in labelDict:
                    labelDict[label] = []
                labelDict[label].append(text)
    return labelDict

stanfordLabelDict = turnIntoLabelsContentDict(stanfordNERJSON)
projectLabelDict = turnIntoLabelsContentDict(projectNERJSON)

In [10]:
# Consistency metrics for Stanford NER:
print(f"Labels used: {stanfordLabelDict.keys()}")
print(f"Total number of words in dataset: {totalWordCount}")
allThreeOverlapStanford = set(stanfordLabelDict["LOCATION"]) & set(stanfordLabelDict["ORGANIZATION"]) & set(stanfordLabelDict["PERSON"])

print(f"Stanford all three overlap: {len(allThreeOverlapStanford)}")
print(f"Stanford Location and organization overlap: {len(set(stanfordLabelDict['LOCATION']) & set(stanfordLabelDict['ORGANIZATION']))}")
print(f"Stanford Location and Person overlap: {len(set(stanfordLabelDict['LOCATION']) & set(stanfordLabelDict['PERSON']))}")
print(f"Stanford person and organization overlap: {len(set(stanfordLabelDict['PERSON']) & set(stanfordLabelDict['ORGANIZATION']))}")

Labels used: dict_keys(['LOCATION', 'ORGANIZATION', 'PERSON'])
Total number of words in dataset: 1546660
Stanford all three overlap: 22
Stanford Location and organization overlap: 129
Stanford Location and Person overlap: 126
Stanford person and organization overlap: 149


In [12]:
# Consistency metrics for my model:
print(f"Labels used: {projectLabelDict.keys()}")
print(f"Total number of words in dataset: {totalWordCount}")
allThreeOverlapMyProject = set(projectLabelDict["LOCATION"]) & set(projectLabelDict["ORGANIZATION"]) & set(projectLabelDict["PERSON"])

print(f"My project's all three overlap: {len(allThreeOverlapMyProject)}")
print(f"My project's Location and organization overlap: {len(set(projectLabelDict['LOCATION']) & set(projectLabelDict['ORGANIZATION']))}")
print(f"My project's Location and Person overlap: {len(set(projectLabelDict['LOCATION']) & set(projectLabelDict['PERSON']))}")
print(f"My project's person and organization overlap: {len(set(projectLabelDict['PERSON']) & set(projectLabelDict['ORGANIZATION']))}")

Labels used: dict_keys(['LOCATION', 'PERSON', 'ORGANIZATION'])
Total number of words in dataset: 1546660
My project's all three overlap: 34
My project's Location and organization overlap: 82
My project's Location and Person overlap: 138
My project's person and organization overlap: 42


Stanfod NER has 22 words that are labeled in all three NER labels while mine has 34. This is a rather small difference compared to the large data size of 1546660 words.

if a word appears in more than one NER label sets, we could generaly say that there is an inconsistency in NER labeling because the nature of a Named Entity -- a word or phrase that has specific meanings -- will not be used differently in different context. Although this is not absolute (as in there will be cases where a Named Entity could be used in different ways), it is still a reliable way of measuring the consistency.

When it comes to overlaps between Location and Person specifically, my model is slightly less consistent than Stanford NER, where my model labeled 138 words twice while Stanford NER had only 126.

However, for overlaps between Lcoation-Organization and Person-Organization, my model had less overlapping labels for both (82 and 42 to Stanford's 129 and 149, respectively).

This means that my model is very comparable with Stanford NER in terms of consistency in labeling.

### Evaluation 3: Qualitative analysis
I will now visually determine whether the NER labels by my model is effective or not. To do so, I will randomly select 3 labels from 100 newspapers that have been labeled and read them.

In [28]:
# Grab 3 random labels from 100 newspapers:
import random

def loadJSONContent(jsonFilePath):
    with open(jsonFilePath, "r") as dataFile:
        return json.load(dataFile)
    
projectDataDict = loadJSONContent(projectNERJSON)

randomNewspaperNERLabels = random.sample(sorted(projectDataDict.items()), 100)


for newsTitle, labelList in randomNewspaperNERLabels:
    if len(labelList) > 3:
        threeRandomSelection = random.sample(labelList, 3)
        print(threeRandomSelection)
    else:
        print(labelList)


[['Montezuma Avenue', 'LOCATION'], ['11th St', 'LOCATION'], ['Ph', 'ORGANIZATION']]
[['Miller', 'PERSON'], ['Pacific', 'LOCATION'], ['White mountains', 'LOCATION']]
[['Mi', 'LOCATION'], ['rizona', 'LOCATION'], ['X', 'LOCATION']]
[['H', 'PERSON'], ['Robinson', 'LOCATION'], ['Lawrence S. Hough', 'PERSON']]
[['I', 'LOCATION'], ['I', 'PERSON'], ['E', 'LOCATION']]
[['Standard Oil Company', 'ORGANIZATION'], ['Arizona', 'LOCATION'], ['Ariz', 'LOCATION']]
[['Fleming', 'LOCATION'], ['t River Valley', 'LOCATION'], ['D', 'LOCATION']]
[['.', 'PERSON'], ['G', 'LOCATION'], ['0', 'PERSON']]
[['I', 'LOCATION'], ['W', 'LOCATION'], ['East Adams', 'LOCATION']]
[['A', 'ORGANIZATION'], ['Phoenix', 'LOCATION'], ['rizona Daily Citizen', 'ORGANIZATION']]
[['Arizona', 'LOCATION'], ['Moore', 'LOCATION'], ['Davidson', 'LOCATION']]
[['Arizona', 'LOCATION'], ['Gil', 'LOCATION'], ['Arizona', 'LOCATION']]
[['IZONA R', 'LOCATION'], ['AR', 'LOCATION'], ['Arizona', 'LOCATION']]
[['Calif', 'LOCATION'], ['Black', 'LOCATI

Looking at the print out, I can see that there is a good amount of accurate labels, such as `['Eddie Rickenbacker', 'PERSON']` and `['Phoenix', 'LOCATION']`. However, once in a while we would see inaccurate labels when a single letter is assigned with a label, such as `['E', 'PERSON']` and `['C', 'PERSON']`.

The model is also able to accurately identify some rather advanced abbreviations, such as labeling N.M. (as in New Mexico) as such: `['N. M.', 'LOCATION']` or NY as such: `['N. Y', 'LOCATION']`. The model can also identify incomplete words, such as "Calif" as California, a location: `['Calif', 'LOCATION']`.

This means that the model, although not perfect, performs at an acceptable accuracy and has a high `precision` score. This also means that, although the model was trained on more contemporary language, it can still identify the Named Entity usage in newspaper language mover 100 years ago.