# Achievement 1.6: Named Entity Recognition and Country Relationships

This notebook uses Natural Language Processing (NLP) and Named Entity Recognition (NER) to analyze a timeline of 20th-century events. The goal is to extract and filter named entities to identify country-level co-occurrences, and ultimately create a relationship dataset for use in a network visualization in Exercise 1.7.

## Table of Contents
1. [Imports and Setup](#1.-Imports-and-Setup)
2. [Text Loading and Preprocessing](#2.-Text-Loading-and-Preprocessing)
3. [Named Entity Recognition](#3.-Named-Entity-Recognition)
4. [Country Entity Filtering](#4.-Country-Entity-Filtering)
5. [Relationship Extraction](#5.-Relationship-Extraction)
6. [Observations and Export](#6.-Observations-and-Export)


## 1. Imports and Setup

In [14]:
# Import required libraries
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import matplotlib.pyplot as plt
import networkx as nx
import os
import re

# Download and load spaCy's English NER model
!python -m spacy download en_core_web_sm
NER = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -- ------------------------------------- 0.8/12.8 MB 6.6 MB/s eta 0:00:02
     ------- -------------------------------- 2.4/12.8 MB 7.1 MB/s eta 0:00:02
     ----------- ---------------------------- 3.7/12.8 MB 6.8 MB/s eta 0:00:02
     --------------- ------------------------ 5.0/12.8 MB 6.7 MB/s eta 0:00:02
     -------------------- ------------------- 6.6/12.8 MB 6.6 MB/s eta 0:00:01
     ------------------------ --------------- 7.9/12.8 MB 6.7 MB/s eta 0:00:01
     ----------------------------- ---------- 9.4/12.8 MB 6.8 MB/s eta 0:00:01
     --------------------------------- ------ 10.7/12.8 MB 6.6 MB/s eta 0:00:01
     ------------------------------------- -- 12.1/12.8 MB 6.6 MB/s eta 0:00:01
     ----------------------------------

## 2. Text Loading and Preprocessing

In [15]:
# Load and read the 20th-century events text file
with open('20th_century_events.txt', 'r', encoding='utf-8', errors='ignore') as file:
    data = file.read().replace('\n', ' ')

# Minimal cleanup
data = data.replace('U.S.A.', 'United States').replace('\xa0', ' ')

### Text Cleaning Observations

- Smart quotes, dashes, and symbols rendered correctly.
- Replaced "U.S.A." with "United States" for consistency.
- No encoding issues observed; no further cleaning required.

## 3. Named Entity Recognition

In [16]:
# Run NER on the full text
book = NER(data)

# Optional preview of extracted entities
displacy.render(book[0:2000], style="ent", jupyter=True)

In [17]:
# Extract sentences and associated named entities
df_sentences = []

for sent in book.sents:
    entity_list = [ent.text for ent in sent.ents]
    df_sentences.append({"sentence": sent.text, "entities": entity_list})

df_sentences = pd.DataFrame(df_sentences)
df_sentences.head()

Unnamed: 0,sentence,entities
0,# Key Events of the 20th Century ## 1900s - ...,"[# Key Events, the 20th Century, 1900s -, The..."
1,- March 2: ThePlatt Amendmentprovides for Cuba...,"[March 2, ThePlatt Amendmentprovides, Cuban, A..."
2,- June:Emily Hobhousereports on the poor condi...,"[June, 45, British, inSouth Africa]"
3,September 6: The assassination ofWilliam McKin...,"[September 6, PresidentTheodore Rooseveltafter..."
4,FirstNobel Prizesawarded. - December 12:Guglie...,"[FirstNobel Prizesawarded, December 12, Guglie..."


## 4. Country Entity Filtering

In [18]:
# Load the country list
country_df = pd.read_csv('country_list.txt', header=None, names=['country_name'])

# Filter named entities using country list
def filter_entity(ent_list, country_df):
    return [ent for ent in ent_list if ent in list(country_df['country_name'])]

df_sentences['country_entities'] = df_sentences['entities'].apply(lambda x: filter_entity(x, country_df))
df_sentences_filtered = df_sentences[df_sentences['country_entities'].map(len) > 0]
df_sentences_filtered.head()

Unnamed: 0,sentence,entities,country_entities
29,- March 15–16:Electionsto the newParliament of...,"[March 15–16, Electionsto the newParliament of...",[Europe]
31,May 26: First commercialMiddle Easternoilfield...,"[May 26, Easternoilfield, June 30, TheTunguska...",[Siberia]
130,May 24:Immigration Act of 1924significantly re...,"[May 24, Asia, the Middle East, Southern Europ...",[Asia]
218,August 23: TheMolotov–Ribbentrop Pactbetween G...,"[August 23, TheMolotov, the Soviet Union, Augu...",[Europe]
253,May:End of World War II in Europe.,"[World War II, Europe]",[Europe]


## 5. Relationship Extraction

In [19]:
# Define relationships between countries within a 5-sentence window
relationships = []

for i in range(df_sentences_filtered.index[-1]):
    end_i = min(i + 5, df_sentences_filtered.index[-1])
    country_list = sum((df_sentences_filtered.loc[i:end_i].country_entities), [])
    
    # Remove consecutive duplicates
    country_unique = [
        country_list[j] for j in range(len(country_list))
        if j == 0 or country_list[j] != country_list[j - 1]
    ]
    
    if len(country_unique) > 1:
        for idx in range(len(country_unique) - 1):
            a = country_unique[idx]
            b = country_unique[idx + 1]
            relationships.append({"source": a, "target": b})

In [20]:
# Convert relationships into a DataFrame and summarize frequencies
relationship_df = pd.DataFrame(relationships)
relationship_df = pd.DataFrame(np.sort(relationship_df.values, axis=1), columns=relationship_df.columns)
relationship_df["value"] = 1
relationship_df = relationship_df.groupby(["source", "target"], sort=False, as_index=False).sum()
relationship_df.head()

Unnamed: 0,source,target,value
0,Europe,Siberia,4
1,United Nations,Western European,3


## 6. Observations and Export

In [21]:
# Save the relationships DataFrame to CSV
relationship_df.to_csv("country_relationships.csv", index=False)

### Final Notes

- Only 15 sentences contained recognized country entities, many of which were broad terms (e.g., "Europe", "Asia").
- spaCy's NER model tended to identify regions and organizations rather than specific countries.
- As a result, only 7 co-occurrence relationships were extracted.
- The current output reflects the limitations of the text and the NER model.

If needed for Exercise 1.7, the logic could be expanded to:
- Include regional aliases and manually map them to countries.
- Use fuzzy matching or external geopolitics libraries.

For now, results are kept within course scope and reflect proper application of NER and filtering.