# 2B: Use stanza to extract all place names from (part of) the corpus

##### This Collab Notebbok is part of our Digital Humanities course Mini Project No. 2, where we explore how to visualize the places mentioned in news articles over time. Using computational tools, we extract toponyms (place names) from each article, map them, and observe how the geographic focus of the news shifts across time. In this Colab notebook, we use Named Entity Recognition (NER), a Natural Language Processing (NLP) technique, to identify place names in a dataset of 4341 Al Jazeera English articles about the Gaza war, compiled by Inacio Vieira. While the dataset contains articles from various dates, we apply a filter to include only those published in January 2024, as instructed. Our focus is on extracting place names specifically from that month.

In [1]:
# Installing Stanza: a Natural Language Processing (NLP) library developed by Stanford for tasks like Named Entity Recognition (NER)
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

In [2]:
# Importing Stanza to be able to perform NER in this notebook
import stanza

In [3]:
# Download the English language model because our articles are in English
stanza.download("en")

# Create the pipeline, specifying the English language and specifying it should only do processing like tokenize the text, separate multiword tokens and perform Named Entity Recognition which is our main task
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [4]:
# clone our FASDH25-portfolio2 folder here so we can be able to perform NER on the collection of articles we have
!git clone https://github.com/yasirrauf-123/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4430, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 4430 (delta 28), reused 17 (delta 17), pack-reused 4394 (from 2)[K
Receiving objects: 100% (4430/4430), 17.80 MiB | 49.93 MiB/s, done.
Resolving deltas: 100% (47/47), done.


In [5]:
# Had an earlier code and modified it with the help of ChatGPT (see ChatGPT Solution No.8 in AI Documentation)
# Importing os to interact with the operating system (e.g., accessing files and directories)
import os

# Define the path to the folder where article text files are stored
folder = "/content/FASDH25-portfolio2/articles"

# Initialize an empty dictionary to store place names and their frequencies
places = {}

# Initialize a counter to keep track of how many articles were published in January 2024
jan_2024_count = 0

# Loop through each file in the folder
for filename in os.listdir(folder):

    # Check if the file name indicates it was published in January 2024
    if "2024-01" in filename:

        # If the condition is true, increment the January 2024 article counter
        jan_2024_count += 1

        # Create the full path to the current file
        path = os.path.join(folder, filename)

        # Open the file and read its contents as a string
        with open(path, encoding="utf-8") as file:
            text = file.read()

        # Use the Stanza NLP pipeline to analyze the text
        doc = nlp(text)

        # Loop through each sentence in the analyzed document
        for sent in doc.sentences:
            # Within each sentence, loop through identified named entities
            for ent in sent.ents:
                # Check if the entity is a place (either a GPE: geopolitical entity, or LOC: location)
                if ent.type in ["GPE", "LOC"]:
                    # Update the dictionary: increment count if it exists, or add it with count 1
                    places[ent.text] = places.get(ent.text, 0) + 1

# After processing all files, print the total number of relevant articles
print(f"Number of articles published in January 2024 in the collection: {jan_2024_count}")

# Print the dictionary of places and their occurrence counts
print(places)

Number of articles published in January 2024 in the collection: 326
{'Israel': 1593, 'Gaza': 1605, 'Palestine': 124, 'the United States': 97, 'Welch’s': 1, 'US': 706, 'Iraq': 62, 'United States': 40, 'West': 24, 'the Global South': 2, 'Qatar': 64, 'Gulf': 10, 'Egypt': 43, 'East Jerusalem': 23, 'Netanyahu’s': 7, 'Gaza Strip': 31, 'the Gaza Strip': 123, 'South Africa': 200, 'Russia': 43, 'Ukraine': 47, 'China': 28, 'South Africa’s': 8, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 42, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 25, 'The Hague': 33, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'The United States': 21, 'The United Kingdom': 3, 'Myanmar': 6, 'Beirut': 84, 'Dahiyeh': 6, 'Lebanon': 175, 'Iran': 206, 'Yemen': 182, 'Beirut’s Shatila': 1, 'Red Sea': 50, 'Africa': 29, 'the Red Sea': 194, 'Gulf of Aden': 4, 'the Cape of Good Hope': 12, 'Singapore': 2, 'the Gulf of Aden': 23, 'The Red Sea': 5, 'Mediterran

In [6]:
"""When we look at our NER output, it’s quite unclean — for example, the count for Gaza is different from Gaza’s, even though they refer to the same place.
Similarly, Red sea and The Red Sea have been counted separately. The same issue happens with United States and US, or Israel and State of Israel. The script
we’ve written is meant to clean up these extra named entity names using the concept of normalization, which we learnt in this course. """

# Code below have been modified and fixed with the help of ChatGPT (See ChatGPT Solution No.9 in AI Documentation)
# importing regex libary to perform normalization
import re

# Create an empty dictionary to store normalized place names
normalized_places = {}

# Place normalization dictionary to handle known variations (e.g., "US" -> "United States")
normalization_map = {
    "uae": "United Arab Emirates",
    "united arab emirates": "United Arab Emirates",
    "us": "United States",
    "u.s.": "United States",
    "u.s": "United States",
    "usa": "United States",
    "united states": "United States",
    "state of israel": "Israel",
    "israel": "Israel",
    "state of palestine": "Palestine",
    "palestine": "Palestine",
    "uk": "United Kingdom",
    "u.k.": "United Kingdom",
    "britain": "United Kingdom",
    "great britain": "United Kingdom",
    "india": "India",
    "indian basmati": "India",
    "michigan": "Michigan",
    "south east michigan": "Michigan",
    "palestinian territories": "Palestine",
    "palestine": "Palestine",
    "south africa": "South Africa",
    "republic of south africa": "South Africa"
}

# Loop through the original dictionary of places and counts
for place, count in places.items():
    # Step 1: Remove possessives like 's (e.g., "Gaza’s" -> "Gaza")
    place = re.sub(r"[’'`]s\b", "", place)

    # Step 2: Remove punctuation (commas, periods, etc.) for consistency
    place = re.sub(r"[^\w\s]", "", place)

    # Step 3: Remove leading 'the' or "The" if it appears
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    # Step 4: Normalize using the map (case-insensitive) Convert the place name to lowercase to ensure consistent lookup in the normalization map,
    #which helps handle variations like "USA", "Us", or "u.s." all mapping to "united states". If the place is not found in the map, fallback to a
    #title-cased version (e.g., "mexico city" → "Mexico City") to improve readability and increase the chances of successful geocoding.
    key = place.lower()
    normalized_name = normalization_map.get(key, place.title())  # Use title-cased version if not in map

    # Step 5: Merge counts
    if normalized_name in normalized_places:
        normalized_places[normalized_name] += count
    else:
        normalized_places[normalized_name] = count

# Print the cleaned and aggregated place names with counts
print(normalized_places)

{'Israel': 1632, 'Gaza': 1623, 'Palestine': 126, 'United States': 879, 'Welch': 1, 'Iraq': 64, 'West': 24, 'Global South': 2, 'Qatar': 65, 'Gulf': 10, 'Egypt': 44, 'East Jerusalem': 23, 'Netanyahu': 7, 'Gaza Strip': 160, 'South Africa': 209, 'Russia': 43, 'Ukraine': 47, 'China': 30, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 43, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 102, 'Hague': 39, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'United Kingdom': 152, 'Myanmar': 6, 'Beirut': 87, 'Dahiyeh': 6, 'Lebanon': 178, 'Iran': 209, 'Yemen': 188, 'Beirut Shatila': 1, 'Red Sea': 250, 'Africa': 29, 'Gulf Of Aden': 27, 'Cape Of Good Hope': 12, 'Singapore': 2, 'Mediterranean': 12, 'Indian Ocean': 2, 'Europe': 30, 'Asia': 18, 'Spain': 7, 'Canada': 42, 'Australia': 13, 'Germany': 31, 'Italy': 10, 'Switzerland': 9, 'Finland': 3, 'Estonia': 1, 'Japan': 9, 'Austria': 3, 'Romania': 4, 'West Bank': 162, 'Syria': 84, 'Octobe

In [7]:
# Write the normalized places and their counts to a TSV file which we will laterly use for mapping
# Help have been taken from ChatGPT while writing the code (See ChatGPT Solution No.10 in AI Documentation)
with open("ner_counts.tsv", "w", encoding="utf-8") as f:
    f.write("placename\tcount\n")  # Write header
    for place, count in normalized_places.items():
        f.write(f"{place}\t{count}\n")

In [8]:
# Opening the tsv file on collab:
with open("/content/ner_counts.tsv", "r", encoding="utf-8") as file:
    print(file.read())

placename	count
Israel	1632
Gaza	1623
Palestine	126
United States	879
Welch	1
Iraq	64
West	24
Global South	2
Qatar	65
Gulf	10
Egypt	44
East Jerusalem	23
Netanyahu	7
Gaza Strip	160
South Africa	209
Russia	43
Ukraine	47
China	30
Malaysia	8
Turkey	25
Jordan	43
Bolivia	4
Maldives	1
Namibia	10
Pakistan	24
Columbia	3
Khan Younis	23
Middle East	102
Hague	39
Bangladesh	2
Comoros	2
Djibouti	4
Netherlands	14
United Kingdom	152
Myanmar	6
Beirut	87
Dahiyeh	6
Lebanon	178
Iran	209
Yemen	188
Beirut Shatila	1
Red Sea	250
Africa	29
Gulf Of Aden	27
Cape Of Good Hope	12
Singapore	2
Mediterranean	12
Indian Ocean	2
Europe	30
Asia	18
Spain	7
Canada	42
Australia	13
Germany	31
Italy	10
Switzerland	9
Finland	3
Estonia	1
Japan	9
Austria	3
Romania	4
West Bank	162
Syria	84
October7	2
Jerusalem	26
Dearborn	12
Michigan	12
Mackinac Island	1
Great Lakes	1
Lake Michigan	1
Afghanistan	7
Texas	3
Beit Nabala	1
Idlib	3
Hamas	6
Tel Aviv	51
Washington	62
Cairo	6
Doha	19
Nuseirat	11
Central Gaza Strip	2
Deir Elbalah	14
Rafah