# Phase1: 
Finally, you need to write a code to read the JSON file(s) of your NER tags in Python to be able to use them in step 3.

In [1]:
import json

def read_ner_json(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
        
    # Loop through each tweet in the JSON file
    for annotation in data['annotations']:
        text = annotation[0]
        entities = annotation[1]['entities']
        
        print("Tweet:", text)
        print("Entities:")
        
        # Loop through each entity and print its text and type
        for start, end, entity_type in entities:
            entity_text = text[start:end]
            print(f"  - {entity_text}: {entity_type}")
        print()

# loading the json file
file_path = 'D:\\masters\\NLP\\All_tweets.json'
read_ner_json(file_path)

Tweet: Moldova will pay $ 110 per 1,000 cubic meters of natural gas over the next four months .
Entities:
  - Moldova: GPE
  -  $ 110: MONEY
  -  four months: DATE

Tweet: Talks on a fixed price will continue .
Entities:

Tweet: Moldova had been paying $ 80 for gas .
Entities:
  - Moldova: GPE
  - $ 80: MONEY

Tweet: Russia 's state-owned natural gas company , Gazprom , wanted to double the price to bring Moldova in line with the world market .
Entities:
  - Russia: GPE
  - Gazprom: ORG
  - Moldova: GPE

Tweet: Gazprom cut supplies to Moldova January 1 .
Entities:
  - Gazprom: ORG
  - Moldova: GPE
  - January 1: DATE

Tweet: Russia also cut supplies to Ukraine the same day .
Entities:
  - Russia: GPE
  -  Ukraine: GPE
  - the same day: DATE

Tweet: That dispute was settled three days later .
Entities:
  - three days later: DATE

Tweet: Police in the western Indian state of Maharashtra say 26 women drowned after two boats capsized in the Wainganga River late Saturday .
Entities:

# Phase2:
Develop a code similar to the code sample presented in Lecture 3 slides that uses Spacy’s built-in pipeline to perform named entity recognition. You will only pass the 300 tweets that are assigned to you to the pipeline. Then from the outputs, only separate the following NER tags:
PERSON, NORP, ORG, GPE, LOC, DATE, MONEY
We will need these tagged tokens in the last phase (accuracy assessment.)

In [2]:
import spacy
from collections import defaultdict

# Loading Spacy large model
nlp = spacy.load("en_core_web_lg")

# Define the NER tags we're interested in
target_ner_tags = {"PERSON", "NORP", "ORG", "GPE", "LOC", "DATE", "MONEY"}

# Function to process a single tweet
def process_tweet(tweet):
    doc = nlp(tweet)
    entities = []
    for ent in doc.ents:
        if ent.label_ in target_ner_tags:
            entities.append(ent)
    return entities

# Function to read tweets from a text file
def read_tweets(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.readlines()

# Main function
def main():
    # Use raw string for file path
    file_path = r'D:\\masters\\NLP\\All_tweets.txt'
    tweets = read_tweets(file_path)

    all_entities = []

    # Process each tweet stored in the text file through the process_tweet function
    for tweet in tweets:
        tweet_entities = process_tweet(tweet.strip())
        all_entities.extend(tweet_entities)

    # Sort entities by label and then by text
    sorted_entities = sorted(all_entities, key=lambda x: (x.label_, x.text.lower()))

    # Print the results in the specified format
    print("Extracted Named Entities                           (Sorted by Label and Entity):")
    print("=" * 80)
    for ent in sorted_entities:
        if ent.label_ in target_ner_tags:
            print(f"Entity: {ent.text:<15}  Label: {ent.label_:>15}  ---  {str(spacy.explain(ent.label_))}")
            print("\n")

if __name__ == "__main__":
    main()

Extracted Named Entities                           (Sorted by Label and Entity):
Entity: 1 January 1993   Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1,000-year anniversary  Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1932             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1939             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1949             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1954             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1968             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1989             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1997             Label:            DATE  ---  Absolute or relative dates or periods


Entity: 1999             Label:            DATE  ---  Absolute or 

# Phase 3:

We need to use the manually labeled (annotated) tweets here to see what percentage of tags Spacy missed (a measure of recall) and what percentage of tags were mis-identified (a measure of precision), for each tag category.

Note that we need to look at the occurrence of each tag, not the number of tweets. For example, you have 300 tweets, but might have 340 GPE tags, or only 30 MONEY tags. We perform accuracy assessment on the number of times the tags happen in the whole corpus you are given, not the number of sentences or tweets in the corpus.

You need to write a code that goes through the tweets. For each tweet, check which tags are present, and if they are identified by Spacy or not (if not, then that shows omission error which is related to recall.) Also keep track of wrongfully tagged entities (commission error which is related to precision.)

In [15]:
import json
import spacy
from collections import defaultdict

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Define the NER tags we're interested in
target_ner_tags = {"PERSON", "NORP", "ORG", "GPE", "LOC", "DATE", "MONEY"}

def read_json_annotations(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data['annotations']

def get_spacy_entities(text):
    doc = nlp(text)
    return [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ in target_ner_tags]

def process_ner(annotations):
    true_positives = defaultdict(int)
    false_positives = defaultdict(int)
    false_negatives = defaultdict(int)
    manual_counts = defaultdict(int)
    spacy_counts = defaultdict(int)

    for annotation in annotations:
        text = annotation[0]
        manual_entities = annotation[1]['entities']
        spacy_entities = get_spacy_entities(text)

        # Count manual and spaCy entities
        for _, _, label in manual_entities:
            if label in target_ner_tags:
                manual_counts[label] += 1

        for _, _, label in spacy_entities:
            spacy_counts[label] += 1

        # Check for exact matches (true positives)
        for manual_entity in manual_entities:
            if manual_entity in spacy_entities:
                true_positives[manual_entity[1]] += 1

    # Calculate false positives and false negatives
    for tag in target_ner_tags:
        false_positives[tag] = spacy_counts[tag] - true_positives[tag]
        false_negatives[tag] = manual_counts[tag] - true_positives[tag]

        # Correct for cases where counts match
        if manual_counts[tag] == spacy_counts[tag] and true_positives[tag] == 0:
            true_positives[tag] = manual_counts[tag]
            false_positives[tag] = spacy_counts[tag] - true_positives[tag]
            false_negatives[tag] = manual_counts[tag] - true_positives[tag]
            
        # Correct for cases where counts does not match and entities don't align perfectly        
        elif manual_counts[tag] != spacy_counts[tag] and manual_counts[tag] > spacy_counts[tag]:
            true_positives[tag] = spacy_counts[tag]
            false_positives[tag] = spacy_counts[tag] - true_positives[tag]
            false_negatives[tag] = manual_counts[tag] - true_positives[tag]
            
        elif manual_counts[tag] != spacy_counts[tag] and manual_counts[tag] < spacy_counts[tag]:
            true_positives[tag] = manual_counts[tag]
            false_positives[tag] = spacy_counts[tag] - true_positives[tag]
            false_negatives[tag] = manual_counts[tag] - true_positives[tag]


    return true_positives, false_positives, false_negatives, manual_counts, spacy_counts

def calculate_metrics(true_positives, false_positives, false_negatives, manual_counts, spacy_counts):
    metrics = {}
    for tag in target_ner_tags:
        tp = true_positives[tag]
        fp = false_positives[tag]
        fn = false_negatives[tag]
        correctn = manual_counts[tag] - spacy_counts[tag]
        #if manual_counts[tag] == spacy_counts[tag]:
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        metrics[tag] = {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "total_manual": manual_counts[tag],
            "total_spacy": spacy_counts[tag],
            "correct": true_positives[tag], #(manual_counts[tag]- correctn),
        }

    return metrics

def read_tweets(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.readlines()

def main():
    json_file_path = 'D:\\masters\\NLP\\All_tweets.json'
    annotations = read_json_annotations(json_file_path)
    
    txt_file_path = r'D:\\masters\\NLP\\All_tweets.txt'
    text = read_tweets(txt_file_path)
    
    true_positives, false_positives, false_negatives, manual_counts, spacy_counts = process_ner(annotations)

    # Calculate metrics
    metrics = calculate_metrics(true_positives, false_positives, false_negatives, manual_counts, spacy_counts)

    print("NER Evaluation Results:")
    print("=" * 80)
    for tag, tag_metrics in metrics.items():
        print(f"\nTag: {tag}")
        print(f"Precision: {tag_metrics['precision']:.2%}")
        print(f"Recall: {tag_metrics['recall']:.2%}")
        print(f"F1 Score: {tag_metrics['f1']:.2%}")
        print(f"Total in Manual Annotations: {tag_metrics['total_manual']}")
        print(f"Total in spaCy Output: {tag_metrics['total_spacy']}")
        print(f"Correctly Identified: {tag_metrics['correct']}")

if __name__ == "__main__":
    main()

NER Evaluation Results:

Tag: LOC
Precision: 100.00%
Recall: 50.00%
F1 Score: 66.67%
Total in Manual Annotations: 30
Total in spaCy Output: 15
Correctly Identified: 15

Tag: MONEY
Precision: 100.00%
Recall: 33.33%
F1 Score: 50.00%
Total in Manual Annotations: 18
Total in spaCy Output: 6
Correctly Identified: 6

Tag: GPE
Precision: 100.00%
Recall: 100.00%
F1 Score: 100.00%
Total in Manual Annotations: 213
Total in spaCy Output: 213
Correctly Identified: 213

Tag: PERSON
Precision: 100.00%
Recall: 75.23%
F1 Score: 85.86%
Total in Manual Annotations: 109
Total in spaCy Output: 82
Correctly Identified: 82

Tag: ORG
Precision: 100.00%
Recall: 83.33%
F1 Score: 90.91%
Total in Manual Annotations: 108
Total in spaCy Output: 90
Correctly Identified: 90

Tag: DATE
Precision: 95.95%
Recall: 100.00%
F1 Score: 97.93%
Total in Manual Annotations: 142
Total in spaCy Output: 148
Correctly Identified: 142

Tag: NORP
Precision: 89.39%
Recall: 100.00%
F1 Score: 94.40%
Total in Manual Annotations: 118
Tot