## Step 1: Import Required Libraries

In this step, I imported all the necessary Python libraries required for the text mining and Named Entity Recognition (NER) process:

- `pandas` and `numpy` – for data manipulation and array handling  
- `spacy` – for tokenization, part-of-speech tagging, and named entity recognition  
- `networkx` – for potential network graph analysis  
- `matplotlib.pyplot` – for future visualizations  
- `scipy` – for scientific computations  
- `os` and `re` – for file handling and regular expressions


In [2]:
# Importing all required libraries
import pandas as pd 
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import os
import matplotlib.pyplot as plt
import scipy
import re

# Load spaCy English model
NER = spacy.load("en_core_web_sm")


## Step 2: Load the Twentieth-Century Text File

In this step, I loaded the cleaned 20th-century historical text file that was prepared in Exercise 1.4. This text contains major events throughout the 20th century and will be used for Named Entity Recognition (NER) and further analysis.

The file is read using Python’s built-in `open()` function, and the contents are stored in a string variable for processing.


In [5]:
# Load the 20th century text file
with open('/Users/muhammaddildar/Desktop/20th_century_scraping/20th_century_scrape.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Quick preview
print(text[:1000])


The 20th century changed the world in unprecedented ways.
The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created.
These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today.
The new beginning of the 20th century marked significant changes.
The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting.
1914 saw the completion of the Panama Canal.
The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent.
The atrocities in the Congo Free State shocked the civilized world.
From 1914 to 1918, the First World War, and its aftermath, caused major changes in the power balance of the world, destroying or transforming some of the most powerful empires.
The First World War (or simply WWI), termed "T

## Step 3: Evaluate and Wrangle the Text

In this step, I evaluated the raw 20th-century text to determine whether any cleaning or wrangling was needed before applying Named Entity Recognition (NER). Specifically, I looked for:

- Special or non-ASCII characters  
- Inconsistent spacing, dashes, or symbols  
- Formatting issues that could affect entity extraction  
- Mismatches between country names in the text and those in my provided `countries_list.txt`

### Observations:
- The text had already been cleaned in Exercise 1.4 and did not contain any unusual symbols or unreadable characters.
- Country names in the text appeared in full form (e.g., “United States” rather than “USA”), which matched the format in `countries_list.txt`.
- No further cleaning was necessary, so I proceeded directly to creating the NER object using spaCy.

Since no additional wrangling was needed, the original file was used as-is without generating a new `.txt` file.


## Step 4: Run Named Entity Recognition (NER) on the Text

Now that the text is clean and loaded, I applied spaCy’s `en_core_web_sm` model to perform Named Entity Recognition (NER).

This identifies important entities such as:
- Countries and cities (GPE)
- People, organizations, and events
- Dates and numeric values

The `doc` object stores all the extracted entity information for each sentence.


In [6]:
# Load spaCy model
import spacy

nlp = spacy.load("en_core_web_sm")

# Apply NLP pipeline to the text
doc = nlp(raw_text)

# Preview a few extracted entities
for ent in doc.ents[:10]:
    print(ent.text, "|", ent.label_)

The 20th century | DATE
The World Wars | WORK_OF_ART
the Cold War | EVENT
the Space Race | ORG
the World Wide Web | EVENT
the 21st century | DATE
today | DATE
the 20th century | DATE
The 1900s | DATE
the decade | DATE


## Step 5: Extract Sentences and GPE Entities

In this step, I created a list of sentence-entity pairs using spaCy's `doc.sents`. I then filtered those sentences to include only the ones that contain GPE entities (i.e., countries, cities, or locations).


In [8]:
# Extract sentences with GPE entities
sentences = list(doc.sents)
sentence_entities = []

for sent in sentences:
    gpe_entities = [ent.text for ent in sent.ents if ent.label_ == "GPE"]
    if gpe_entities:
        sentence_entities.append((sent.text.strip(), gpe_entities))

# Preview the first 5 results
for item in sentence_entities[:5]:
    print(item)


("The war was precipitated by the Assassination in Sarajevo of the Austro-Hungarian Empire's heir to the throne, Erzherzog Franz Ferdinand, by Gavrilo Princip, a member of the Young Bosnia liberation movement.", ['Sarajevo'])
('After a period of diplomatic and military escalation known as the July Crisis, by the end of July 1914 two coalitions were at war: the Allies, comprised initially of the British Empire, France, and the Russian Empire; and the Central Powers, comprised initially of the German Empire and Austria-Hungary.', ['the British Empire', 'France', 'the Russian Empire', 'the German Empire', 'Austria'])
('In 1917, Russia ended hostile actions against the Central Powers after the fall of the Tsar.', ['Russia', 'Tsar'])
('The Bolsheviks negotiated the Treaty of Brest-Litovsk with Germany, although it was a huge cost to Russia.', ['Germany', 'Russia'])
('In the treaty, Bolshevik Russia ceded the Baltic states to Germany, and its province of Kars Oblast in the South Caucasus to 

## Step 6: Filter GPE Entities Using Country List

Next, I filtered the extracted GPEs to only include valid country names using a predefined list of countries from `countries_list.txt`.

This ensures only country-related relationships are included in the final analysis and avoids unrelated GPEs like cities or regions.


In [9]:
# Load the list of valid country names
with open('/Users/muhammaddildar/Desktop/20th_century_scraping/countries_list.txt', 'r', encoding='utf-8') as f:
    country_list = [line.strip() for line in f]

# Convert to set for faster lookup
country_set = set(country_list)

# Filter sentence_entities to keep only GPEs that match country list
filtered_entities = []

for sentence, gpes in sentence_entities:
    matching_countries = [gpe for gpe in gpes if gpe in country_set]
    if matching_countries:
        filtered_entities.append((sentence, matching_countries))

# Preview the first 5 filtered sentence-country pairs
for item in filtered_entities[:5]:
    print(item)


('After a period of diplomatic and military escalation known as the July Crisis, by the end of July 1914 two coalitions were at war: the Allies, comprised initially of the British Empire, France, and the Russian Empire; and the Central Powers, comprised initially of the German Empire and Austria-Hungary.', ['France', 'Austria'])
('In 1917, Russia ended hostile actions against the Central Powers after the fall of the Tsar.', ['Russia'])
('The Bolsheviks negotiated the Treaty of Brest-Litovsk with Germany, although it was a huge cost to Russia.', ['Germany', 'Russia'])
('In the treaty, Bolshevik Russia ceded the Baltic states to Germany, and its province of Kars Oblast in the South Caucasus to the Ottoman Empire.', ['Germany'])
('It also recognized the independence of Ukraine.', ['Ukraine'])


## Step 7: Create Relationships DataFrame

Using the filtered sentence-country pairs, I created a pandas DataFrame to structure the relationships.

Each row contains:
- The full sentence
- The list of countries mentioned in that sentence


In [11]:
import pandas as pd

# Create DataFrame
relationships_df = pd.DataFrame(filtered_entities, columns=['Sentence', 'Countries'])

# Preview the first 5 rows
relationships_df


Unnamed: 0,Sentence,Countries
0,After a period of diplomatic and military esca...,"[France, Austria]"
1,"In 1917, Russia ended hostile actions against ...",[Russia]
2,The Bolsheviks negotiated the Treaty of Brest-...,"[Germany, Russia]"
3,"In the treaty, Bolshevik Russia ceded the Balt...",[Germany]
4,It also recognized the independence of Ukraine.,[Ukraine]
...,...,...
109,"China, an ancient nation comprising a fifth of...",[China]
110,The influence of China and India was also risi...,"[China, India]"
111,"Meanwhile in South Africa, the apartheid came ...",[South Africa]
112,"In Rwanda, an estimated one million people wer...",[Rwanda]


## Step 8: Export the Relationships DataFrame

After creating the final relationships DataFrame, I exported it as a CSV file for backup and future analysis.


In [12]:
# Export to CSV
relationships_df.to_csv('/Users/muhammaddildar/Desktop/20th_century_scraping/relationships_dataframe.csv', index=False)

print("✅ File exported successfully.")


✅ File exported successfully.


## Step 9: Save and Submit Notebook

To complete the task:
- I saved this Jupyter notebook (.ipynb)
- Downloaded it along with the CSV file
- Pushed both files to my GitHub repository
- Shared the link with my mentor for review
