Table of Contents:  
1. Install spaCy model + import libraries.
2. Text wrangling.
3. Network analysis.

1. Install spaCy model + import libraries

In [1]:
# Import libaries
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import matplotlib.pyplot as plt
import re
import os

  import pkg_resources


In [4]:
# Import spaCy model
import spacy

spacy.cli.download("en_core_web_sm")

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.4 MB/s[0m  [33m0:00:03[0mm0:00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
# Run the model
NER = spacy.load("en_core_web_sm")

In [6]:
# Load the twentieth-century events text
with open("key_events_20th_century.txt", "r", errors="ignore") as f:
    data = f.read()

print(data[:1000])

The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today.
Historic events in the 20th century[edit]
World at the beginning of the century[edit]
Main article: Edwardian era
Map of colonial and land-based empires throughout the world in 1914
The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal.
The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world.
From 1914 to 1918, the First World War, a

2. Text wrangling. 

In [7]:
# Wrangle the text
import re

data_clean = (
    data.replace("\n", " ")
        .replace("\xa0", " ")   # non-breaking spaces
        .replace("’", "'")
        .replace("“", '"')
        .replace("”", '"')
)

# collapse multiple spaces
data_clean = re.sub(r"\s+", " ", data_clean).strip()

print(data_clean[:1000])

The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today. Historic events in the 20th century[edit] World at the beginning of the century[edit] Main article: Edwardian era Map of colonial and land-based empires throughout the world in 1914 The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal. The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world. From 1914 to 1918, the First World War, a

In [8]:
# Remove [edit] tags in the text
data_clean = re.sub(r"\[edit\]", "", data_clean)
print(data_clean[:1000])

The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today. Historic events in the 20th century World at the beginning of the century Main article: Edwardian era Map of colonial and land-based empires throughout the world in 1914 The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal. The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world. From 1914 to 1918, the First World War, and its after

In [12]:
# Delete patterns
patterns_to_remove = [
    "Historic events in the 20th century",
    "World at the beginning of the century",
    "Edwardian era",
    "Map of colonial and land-based empires throughout the world in 1914",
    "Map of",          # general cleanup for image captions
    "Main article:",
]

for pat in patterns_to_remove:
    data_clean = data_clean.replace(pat, "")

In [13]:
# Collapse spaces again
data_clean = re.sub(r"\s+", " ", data_clean).strip()

In [14]:
# Check
print(data_clean[:1000])

The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today. The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal. The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world. From 1914 to 1918, the First World War, and its aftermath, caused major changes in the power balance of the world, destroying or transforming some of the most powerful empires. "The war to end all wars": World War I (1914–1

In [15]:
# Save the cleaned text as a .txt file
clean_path = "twentieth_century_events_clean.txt"

with open(clean_path, "w", encoding="utf-8") as f:
    f.write(data_clean)

clean_path

'twentieth_century_events_clean.txt'

Text wrangling summary:

I examined the raw twentieth-century events text file for special characters, formatting issues, and naming inconsistencies.

Observations:

- The text contained newline characters (\n) which were replaced with spaces.

- Non-breaking spaces (\xa0) and smart quotes (’, “, ”) were present and removed.

- A section of the scraped text included Wikipedia-style navigation artifacts such as
"Historic events in the 20th century", "World at the beginning of the century", "Edwardian era", and image captions like "Map of colonial and land-based empires...".
These were removed because they are not part of the narrative and disrupt NLP processing.

I inspected the country names in my countries list and compared them to the text.
No major mismatches were found, and names appeared consistent.

Corrections applied:

- Replaced problematic characters with standard ASCII equivalents.

- Removed leftover heading/caption fragments from the scraped source.

- Collapsed multiple spaces to single spaces.

- Saved the cleaned version as a .txt file:
twentieth_century_events_clean.txt

3. Network analysis.

In [16]:
# Create NER object
with open("twentieth_century_events_clean.txt", "r", encoding="utf-8") as f:
    text = f.read()

book = NER(text)

In [17]:
# Split text into sentences and extract entities
df_sentences = []

for sent in book.sents:
    entity_list = [ent.text for ent in sent.ents]
    df_sentences.append({
        "sentence": sent.text,
        "entities": entity_list
    })

df_sentences = pd.DataFrame(df_sentences)
df_sentences.head(10)

Unnamed: 0,sentence,entities
0,The 20th century changed the world in unpreced...,[The 20th century]
1,The World Wars sparked tension between countri...,"[the Cold War, the Space Race]"
2,These advancements have played a significant r...,"[the 21st century, today]"
3,The new beginning of the 20th century marked s...,[the 20th century]
4,The 1900s saw the decade herald a series of in...,"[The 1900s, the decade]"
5,1914 saw the completion of the Panama Canal.,"[1914, the Panama Canal]"
6,The Scramble for Africa continued in the 1900s...,"[Scramble, Africa, the 1900s]"
7,The atrocities in the Congo Free State shocked...,[the Congo Free State]
8,"From 1914 to 1918, the First World War, and it...","[1914 to 1918, the First World War]"
9,"""The war to end all wars"": World War I (1914–1...","[World War I, World War I Arrest, Sarajevo, th..."


In [21]:
# Load the country list
countries_df = pd.read_csv("countries_list_20th_century.csv")
countries_df.head()
print(countries_df.columns)

Index(['Unnamed: 0', 'country_name'], dtype='object')


In [22]:
# Clean the country list dataframe
countries_df = countries_df.drop(columns=["Unnamed: 0"])
countries_df.head()

Unnamed: 0,country_name
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola


In [23]:
# Create the country lookup set
country_lookup = set(countries_df["country_name"].unique())
len(country_lookup), list(country_lookup)[:10]

(209,
 ['   Slovenia ',
  '  Oman',
  '  Armenia ',
  '   Latvia ',
  '   Montenegro ',
  '  Antigua and Barbuda ',
  '  Qatar',
  '   Egypt ',
  '   Saint Vincent and the Grenadines ',
  '  Romania '])

In [24]:
# Strip whitespace from all country names
countries_df["country_name"] = countries_df["country_name"].str.strip()
country_lookup = set(countries_df["country_name"].unique())

len(country_lookup), list(country_lookup)[:10]

(208,
 ['Chile',
  'Micronesia, Federated States of',
  'Syria',
  'Algeria',
  'Latvia',
  'Sahrawi Arab Democratic Republic',
  'Croatia',
  'Sweden',
  'Brunei',
  'Gabon'])

In [25]:
# Keep only entities that are country names
def filter_entity(ent_list, country_lookup):
    return [ent for ent in ent_list if ent in country_lookup]

df_sentences["country_entities"] = df_sentences["entities"].apply(
    lambda x: filter_entity(x, country_lookup)
)

df_sentences.head(10)

Unnamed: 0,sentence,entities,country_entities
0,The 20th century changed the world in unpreced...,[The 20th century],[]
1,The World Wars sparked tension between countri...,"[the Cold War, the Space Race]",[]
2,These advancements have played a significant r...,"[the 21st century, today]",[]
3,The new beginning of the 20th century marked s...,[the 20th century],[]
4,The 1900s saw the decade herald a series of in...,"[The 1900s, the decade]",[]
5,1914 saw the completion of the Panama Canal.,"[1914, the Panama Canal]",[]
6,The Scramble for Africa continued in the 1900s...,"[Scramble, Africa, the 1900s]",[]
7,The atrocities in the Congo Free State shocked...,[the Congo Free State],[]
8,"From 1914 to 1918, the First World War, and it...","[1914 to 1918, the First World War]",[]
9,"""The war to end all wars"": World War I (1914–1...","[World War I, World War I Arrest, Sarajevo, th...",[]


In [26]:
# Keep only sentences that mention at least one country
df_sentences_filtered = df_sentences[
    df_sentences["country_entities"].map(len) > 0
].reset_index(drop=True)

df_sentences_filtered.head(10)
df_sentences_filtered.tail(10)

Unnamed: 0,sentence,entities,country_entities
115,"""The forgotten violence that helped India brea...",[India],[India]
116,"""Indian Independence Day: everything you need ...","[Indian Independence Day, India, Pakistan, 70 ...","[India, Pakistan]"
117,"""The Philippines, 1898–1946 | US House of Repr...","[Philippines, 1898–1946, US House of Represent...",[Philippines]
118,"""Colonial Cartographies, Postcolonial Borders,...","[Colonial Cartographies, Enduring Failures of ...",[Afghanistan]
119,"The Moldovans: Romania, Russia, and the Politi...","[Moldovans, Romania, Russia, the Politics of C...","[Romania, Russia]"
120,"""Selling 'Operation Passage to Freedom': Dr. T...","[Thomas Dooley, the Religious Overtones of Ear...",[Vietnam]
121,"""Stuck in Endless Preliminaries: Vietnam and t...","[Vietnam, the Battle of the Paris Peace Table,...",[Vietnam]
122,"""Anti-American Behavior in the Middle East: Ev...","[Anti-American, the Middle East, a Field Exper...",[Lebanon]
123,The Rise of China and India: A New Asian Drama.,"[The Rise of China, India]",[India]
124,Singapore: World Scientific.,[Singapore],[Singapore]


In [27]:
# Build the list of country–country relationships
relationships = []
window_size = 5  # number of sentences to look at together

for i in range(len(df_sentences_filtered)):
    end_i = min(i + window_size, len(df_sentences_filtered) - 1)
    
    # flatten all country_entities from sentences i..end_i
    country_list = sum(
        df_sentences_filtered.loc[i:end_i, "country_entities"].tolist(),
        []
    )
    
    # remove immediate duplicates
    country_unique = [
        country_list[j] for j in range(len(country_list))
        if j == 0 or country_list[j] != country_list[j - 1]
    ]
    
    # add pairwise relationships
    if len(country_unique) > 1:
        for idx, a in enumerate(country_unique[:-1]):
            b = country_unique[idx + 1]
            relationships.append({"source": a, "target": b})

relationship_df = pd.DataFrame(relationships)
relationship_df.head(10)

Unnamed: 0,source,target
0,France,Russia
1,Russia,Germany
2,Germany,Russia
3,Russia,Germany
4,Russia,Germany
5,Germany,Russia
6,Russia,Germany
7,Germany,Italy
8,Germany,Russia
9,Russia,Germany


In [28]:
# Summarize by pair and count how often each appears

# sort each row so A–B and B–A collapse together
relationships_df = pd.DataFrame(
    np.sort(relationship_df.values, axis=1),
    columns=relationship_df.columns
)

# helper weight column
relationships_df["value"] = 1

# group and sum
relationships_df = relationships_df.groupby(
    ["source", "target"],
    sort=False,
    as_index=False
).sum()

relationships_df.head(10)

Unnamed: 0,source,target,value
0,France,Russia,7
1,Germany,Russia,13
2,Germany,Italy,32
3,Austria,Germany,11
4,Germany,Spain,5
5,France,Spain,5
6,France,Poland,11
7,France,Germany,28
8,Germany,Poland,33
9,Estonia,Germany,5


In [29]:
# Save the relationships_df.to_csv("country_relationships_20th_century.csv", index=False)
relationships_df.to_csv("country_relationships_20th_century.csv", index=False)

In [30]:
# Quick check
relationships_df.sort_values("value", ascending=False).head(10)

Unnamed: 0,source,target,value
32,Germany,Japan,33
8,Germany,Poland,33
2,Germany,Italy,32
7,France,Germany,28
33,Japan,Russia,22
46,India,Pakistan,18
40,Japan,Solomon Islands,17
90,Italy,Japan,16
42,Japan,Philippines,16
29,Egypt,Libya,16


**Interpretation of relationship results**

The strongest country–country connections extracted from the text align closely with major geopolitical dynamics of the 20th century, indicating that the pipeline captured historically meaningful interactions.

For example:

- Germany–Japan and Germany–Italy appear frequently due to their alliance as Axis Powers in World War II.

- Germany–Poland is prominent because Poland’s invasion marked the beginning of World War II and is widely referenced in historical narratives.

- France–Germany emerges clearly as a central relationship due to recurring conflict and reconciliation across both World Wars.