# Using stanza for Named Entity Recognition (continued)

## Installation of Stanza

In [None]:
 #Installing stanza
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook. Here also , I imported the necessary libraries

In [None]:
import stanza
import os
import requests
import time

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:

stanza.download("en")
nlp = stanza.Pipeline("en", processors="tokenize,ner")


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


# Cloning the portfolio repository

In [None]:
!git clone https://github.com/rifat-jahan123/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4358, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 4358 (delta 1), reused 0 (delta 0), pack-reused 4354 (from 2)[K
Receiving objects: 100% (4358/4358), 17.77 MiB | 31.92 MiB/s, done.
Resolving deltas: 100% (7/7), done.


Listing and filtering files to keep only January 2024 articles from the corpus folder

In [None]:
# setting the file pathway of the articles folder
path = "/content/FASDH25-portfolio2/articles"
#list the files in the folder and only use the articles from January 2024
files = os.listdir(path)
# Keep articles from Jan 2024 only
jan_files = [f for f in files if f.startswith("2024-01-")]

print("January files found:", len(jan_files))

January files found: 326


Cleaning and normalizing the place names  
removing possessive endings (like “’s”) from place names to ensure consistency—for example, converting "Gaza’s" to "Gaza" to avoid duplicate entries when counting.


In [1]:
import re

normalized_places = {} # Create an empty dictionary to store cleaned place names

# Iterating through each place and its count
for place, count in places.items():
     # Step 1: Remove 's at the end (like Gaza's → Gaza)
    place = re.sub(r"[’'`]s\b", "", place)

    # Removing possesive endings(e.g., Gaza's to Gaza)
    place = re.sub(r"[^\w\s]", "", place)

   #Removing "The" at the beginning of the name
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

   # Adding the count to the cleaned place name
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count
# print the cleaned place names with their total counts
print(normalized_places)

NameError: name 'places' is not defined

### Extracting the place names from January 2024 articles using stanza NER



In [None]:
# Extracting Place Names from January 2024 Articles using Stanza NER
# Initializing an empty dictionary to store place names and their counts
places = {}
# path to the folder in our portfolio repo
folder = "/content/FASDH25-portfolio2/articles"

 # selecting only the first 5 January 2024 files ChatGPT solution -2
jan_files = [f for f in os.listdir(folder) if "2024-01" in f][:5]
#Iterating through the file selected files
for filename in jan_files:
    if "2024-01" in filename:
        path = os.path.join(folder, filename)
#Opening the file to read the content of the file
        with open(path, encoding="utf-8") as file:
            text = file.read()
#Applying stanza NLP to extract named entities
            doc = nlp(text)
#Looping through the named entities
            for e in doc.entities:
#Checking if the entity is a geopolitical/geographical location
                 if e.type in ["GPE", "LOC"]
#Cleaning and extracting the entity text
                    place = e.text.strip()
                    places[place] = places.get(place, 0) + 1


# Display the counts for places found
for place, count in list(places.items())[:10]:
    print(f"{place}: {count}")


Morocco: 13
Israel: 28
Gaza: 27
Rabat: 3
United States: 4
the United Arab Emirates: 1
UAE: 3
Bahrain: 1
Sudan: 1
US: 16


We could come up with ways to fix errors like these.

One option would be to create a dictionary of known errors,
so that when we loop through the entities, we can fix them:

### multiple files

Since we can do this in one file, we can also do this for a large number of files!

Let's download our FASDH25 git repository here. Because we don't use Python to clone a git repository, we add an exclamation mark before the command `git` in Colab (as we did with `pip`). Complete the command below and run it:



In [None]:
# cloning our FASDH25 portfolio 2 folder here:
!git clone https://github.com/rifat-jahan123/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


We can now loop through the articles in the folder as we did when we were using regex to find filenames:

In [None]:
# Extracting and Counting Place Names from All January 2024 Articles using Stanza
import os
# Initializing an empty dictionary to store place names and their counts
places = {}
# Defining the path to the folder containing all article files
folder = "/content/FASDH25-portfolio2/articles"
# Initializing a counter to keep track of how many January 2024 articles are being processed
jan_2024_article_count = 0
# Looping through each file in the specified folder
for filename in os.listdir(folder):
  # Checking if the file name starts with "2024-01" to filter January 2024 articles
    if "2024-01" in filename:
        jan_2024_article_count += 1
# Creating the full path to the file
        path = os.path.join(folder, filename)
# Opening and reading the file
        with open(path, encoding="utf-8") as file:
            text = file.read()
# Running Stanza's NLP pipeline on the article text
            doc = nlp(text)
            for e in doc.entities:
                if e.type in ["GPE", "LOC"]:
                    place = e.text.strip()
                    places[place] = places.get(place, 0) + 1

# Print how many articles were found from January 2024
print("Number of articles from January 2024:", jan_2024_article_count)
print(places)



KeyboardInterrupt: 

### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

In [None]:
# Defining the name of the output file where results will be saved
filename = "ner_counts.tsv"

# Open the file so we can write into it, using UTF-8 text format
with open(filename, mode="w", encoding="utf-8") as file:
    # Create the first line of the file with column names, separated by a tab
    header = "placename\tcount\n"
    # write the header to the file:
    file.write(header)
# Looping through each place and its count in the cleaned, normalized dictionary
    for entity, count in normalized_places.items():
# Creating a row string with place name and its count, separated by a tab
        row = f"{entity}\t{count}\n"
#writing the row to the file:
        file.write(row)


The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [None]:
# Opening the file from the path where it's stored
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
    print(file.read())

name	frequency
Gaza’s	241
Israel	21962
Gaza	17008
US	7347
Hamas’s	72
Israel’s	321
Shati	25
Gaza Strip	795
Tel Aviv	563
West Bank	3220
the West Bank	1701
the Gaza Strip	2226
Egypt	1344
South Africa’s	18
Friday’s	3
Strip	245
al-Nasr	4
Deir El Balah	1
Deir al-Balah	17
Ashkelon	31
Jerusalem	1414
Bat Ayin	1
Qatar	603
Silwan	79
East Jerusalem	1519
Al Bustan	2
City of David	4
Bustan	2
Batan Al Hawa	1
Area C	67
Iran	1399
Jenin	1346
Lebanon	1313
Syria	610
Rafah	1416
Malta	28
southern Gaza Strip	19
Gaza City	808
Nablus	629
Beit Hanoon	41
Beit Lahiya	26
Jabalia	202
Ramallah	768
Shatila	28
Sabra	14
Beirut	259
Palestine	2288
Tehran	256
Washington	898
DC	166
the United States	1056
New York	169
Texas	31
Huwara	129
@yuhline	1
New York City	31
@YousefMunayyer	1
Qabatiya	11
Qabatyah	1
el-Bireh	10
Beitin	1
Ukraine	531
Los Angeles	29
California	47
Osage	1
Mariupol	6
Russia	474
Alabama	7
Dura	8
Hebron	269
Tulkarem	72
Washington, DC	189
The United States	270
Saudi Arabia	417
Jeddah	13
Riyadh	79
the Middle E

 # 3. Create a gazetteer for the NER places




Using the GeoNames API to retrieve coordinates for each place name

In [None]:
#importing the necessary libraries
import requests
import time
# Setting my GeoNames username for authentication
geonames_username = "rifat.jahan"

def get_coordinates(place, username=geonames_username, fuzzy=0, timeout=1):
  """This function gets a single set of coordinates from the geonames API.
#Creating a function to retrieve coordinates for a given place using GeoNames API.

  Args:
    place (str): the place name
    username (str): your geonames user name
    fuzzy (int): 0 = exact matching, 1 = fuzzy matching (allow similar but not exact matches)
    timeout (int): number of seconds to wait before a call to the geonames API
      (to avoid being blocked for overloading the server)

  Returns:
    dictionary: keys: latitude, longitude
  """
  # Adding a delay to avoid the overload to the server:
  time.sleep(timeout)
  # Constructing the API request URL and parameters
  url = "http://api.geonames.org/searchJSON?"
  params = {"q": place,
            "username": username,
            "fuzzy": fuzzy,
            "maxRows": 1,
            "isNameRequired": True
  }
# Sending the GET request to the GeoNames API
  response = requests.get(url, params=params)
# Parsing the response as a JSON object
  results = response.json()
  print(results)
  # extracting the first match from the geocoding resulst:
  try:
    result = results["geonames"][0]
    return {"latitude": result["lat"], "longitude": result["lng"]}
  except (IndexError, KeyError):
    print("No results found for your API call", response.request.url)

Reading place names, retrieving coordinates, and writing them to a gazetteer file



In [None]:
#input and output filenames for reading place names and writing the gazetteer
input_filename = "ner_counts.tsv"
output_filename = "ner_gazetteer.tsv"

# Reading place names from ner_counts.tsv
with open(input_filename, "r", encoding="utf-8") as file:
    lines = file.readlines()


# Skipping the header line and extracting place names from the first column
place_names = [line.strip().split("\t")[0] for line in lines[1:]]

# Opening the output file for writing the final gazetteer with coordinates
with open(output_filename, "w", encoding="utf-8") as out_file:
  # Writing the header row with column names
    out_file.write("Name\tLatitude\tLongitude\n")
# Looping through each place name to get and write its coordinates
    for name in place_names:
# Calling the get_coordinates() function to fetch lat/lon
        coordinates = get_coordinates(name)
        if coordinates:
            lat = coordinates['latitude']
            lon = coordinates['longitude']
            out_file.write(f"{name}\t{lat}\t{lon}\n")
        else:
            out_file.write(f"{name}\tNA\tNA\n")

# Display the file
with open(output_filename, encoding="utf-8") as file:
    print(file.read())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
No results found for your API call http://api.geonames.org/searchJSON?q=Jalaa&username=rifat.jahan&fuzzy=0&maxRows=1&isNameRequired=True
{'totalResultsCount': 14, 'geonames': [{'adminCode1': 'VLG', 'lng': '4.40026', 'geonameId': 2803138, 'toponymName': 'Antwerpen', 'countryId': '2802361', 'fcl': 'P', 'population': 529247, 'countryCode': 'BE', 'name': 'Antwerp', 'fclName': 'city, village,...', 'adminCodes1': {'ISO3166_2': 'VLG'}, 'countryName': 'Belgium', 'fcodeName': 'populated place', 'adminName1': 'Flanders', 'lat': '51.22047', 'fcode': 'PPL'}]}
{'totalResultsCount': 1, 'geonames': [{'adminCode1': '11', 'lng': '36.04028', 'geonameId': 273617, 'toponymName': 'Jebaa', 'countryId': '272103', 'fcl': 'P', 'population': 0, 'countryCode': 'LB', 'name': 'Jebaa', 'fclName': 'city, village,...', 'adminCodes1': {'ISO3166_2': 'BH'}, 'countryName': 'Lebanon', 'fcodeName': 'populated place', 'adminName1': 'Baalbek-Hermel Governorate'