# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
import stanza
import os
import pandas as pd
import re



## Cloning the repository



In [None]:
!git clone https://github.com/syedarslan476/FASDH25-portfolio2.git


Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4396, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 4396 (delta 12), reused 11 (delta 8), pack-reused 4375 (from 2)[K
Receiving objects: 100% (4396/4396), 17.79 MiB | 20.09 MiB/s, done.
Resolving deltas: 100% (23/23), done.


In [None]:
# Download the language model:
stanza.download("en")
#took help from slides 11.1

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


Filtering articles from January 2024



In [None]:
import os

# Set the folder path
folder = "/content/FASDH25-portfolio2/articles"

# List all files and filter only January 2024 articles
files = os.listdir(folder)
jan_files = [file for file in files if file.startswith("2024-01-") and file.endswith(".txt")]

# Show how many January articles were found
print("January files found:", len(jan_files))

# Create an empty dictionary to hold the places and their frequencies
places = {}
# Took some help from presentaion 9.2
# Loop through all the January files
for filename in jan_files:
    # Create the path to the file
    path = f"{folder}/{filename}"
    # Open and read the file
    with open(path, encoding="utf-8") as file:
        text = file.read()
        # Use the NLP pipeline to analyze the text
        doc = nlp(text)
        # Select only entities that are place names
        for e in doc.entities:
            if e.type in ["GPE", "LOC"]:
                # Add 1 to the count of the place in the dictionary
                places[e.text] = places.get(e.text, 0) + 1

# Print the resulting places dictionary
print(places)


January files found: 326
{'Morocco': 13, 'Israel': 1593, 'Gaza': 1605, 'Rabat': 3, 'United States': 40, 'the United Arab Emirates': 13, 'UAE': 7, 'Bahrain': 11, 'Sudan': 3, 'US': 706, 'Western Sahara': 3, 'Washington': 60, 'Tel Aviv': 49, 'Algeria': 7, 'Marrakesh': 1, 'the Western Sahara': 1, 'Morocco’s': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 120, 'Dena': 1, 'Israel’s': 31, 'Oakland': 1, 'the United States': 97, 'South Africa': 200, 'Jordan': 42, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 43, 'Qatar': 64, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 124, 'Indonesia’s': 1, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 3, 'UK': 95, 'Manchester': 1, 'Yemen': 182, 'Washington, DC': 4, 'India': 50, 'Hyderabad': 1, 'Colombo’s Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 2, 'Iran': 206, 'Kerman': 6, 'Lebanon': 175, 'Bethlehem': 4, 'Na

### Cleaning Places names

We can improve the readability by adding xml-style opening and closing tags (e.g., `<GPE>Rafah</GPE>`) instead of only a tag at the beginning of the entity. Adapt the code below so that it adds xml-style start and end tags:

In [None]:
import re
# Mostly took help from class presentation for the code
normalized_places = {}  # This will store cleaned place names with their total counts

# For standard naming conventions dictionary
# took help from Kulsoom
standard_names = {
    'gaza': 'Gaza',
    'u.s.': 'United States',
    'usa': 'United States',
    'uk': 'United Kingdom',
    'uae': 'United Arab Emirates',
    'britain': 'United Kingdom',
    'state of israel': 'Israel',
    'islamic republic of iran': 'Iran',
    'republic of yemen': 'Yemen',
    'state of palestine': 'Palestine',
    'beruit': 'Beirut',
    'dahiyeb': 'Dahiyeh',
    'tel israel': 'Tel Aviv',
    'westbank': 'West Bank',
    'gaza city': 'Gaza'
}
# Used presentation 9.2 for help
for place, count in places.items(): #took some help from chatgpt as well
    place = re.sub(r"[’'`]s\b", "", place)  # Remove possessive 's (like "Ali's Town" → "Ali Town")
    place = re.sub(r"[^\w\s]", "", place)   # Remove punctuation (like dots, commas, etc.)
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)  # Remove 'The' at the beginning (case-insensitive)
    place = place.replace('\n', ' ')  # Convert newlines to spaces
    place = place.strip()  # Remove extra spaces

    # consider Gaza as a special case
    if re.search(r'gaza', place.lower()):
        normalized = standard_names['gaza']
    else:
        # Use the standardized name if available
        normalized = standard_names.get(place.lower(), place)

    # Add or update count for the normalized name
    if normalized in normalized_places:
        normalized_places[normalized] += count
    else:
        normalized_places[normalized] = count

# Show the final cleaned and combined place names with counts
print(normalized_places)


{'Morocco': 14, 'Israel': 1632, 'Gaza': 1830, 'Rabat': 3, 'United States': 162, 'United Arab Emirates': 21, 'Bahrain': 11, 'Sudan': 3, 'US': 717, 'Western Sahara': 4, 'Washington': 62, 'Tel Aviv': 52, 'Algeria': 7, 'Marrakesh': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 164, 'Dena': 1, 'Oakland': 1, 'South Africa': 208, 'Jordan': 43, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 44, 'Qatar': 65, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 125, 'Indonesia': 3, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 7, 'United Kingdom': 152, 'Manchester': 1, 'Yemen': 189, 'India': 50, 'Hyderabad': 1, 'Colombo Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 3, 'Iran': 210, 'Kerman': 6, 'Lebanon': 178, 'Bethlehem': 4, 'Nairoukh': 1, 'China': 30, 'Italy': 10, 'Spain': 7, 'Turkey': 25, 'Shawawra': 1, 'Hague': 39, 'Khan Younis': 23, 'Syria': 84, 'Mazzeh'

We could come up with ways to fix errors like these.

One option would be to create a dictionary of known errors,
so that when we loop through the entities, we can fix them:

### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

In [None]:
filename = "ner_counts.tsv"
# open the file in writing mode and with unicode UTF-8 encoding:
with open("ner_counts.tsv", mode= "w", encoding= "utf-8") as file:
  # create a header of the tsv files, which consists of the column names separated by a tab:
  header = "Place\tCount\n"
  # write the header to the file:
  file.write(header)
  # Now, loop through the places dictionary and create a new row for each item in the dictionary
  for place, count in normalized_places.items():
    row = f"{place}\t{count}\n"
    # finally, write the row to the file:
    file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [None]:
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
Morocco	14
Israel	1632
Gaza	1830
Rabat	3
United States	162
United Arab Emirates	21
Bahrain	11
Sudan	3
US	717
Western Sahara	4
Washington	62
Tel Aviv	52
Algeria	7
Marrakesh	1
Maghreb	1
Ukraine	47
Saudi Arabia	39
California	3
West Bank	164
Dena	1
Oakland	1
South Africa	208
Jordan	43
Jerusalem	26
East Jerusalem	23
Egypt	44
Qatar	65
Kuala Lumpur	4
Malaysia	8
Palestine	125
Indonesia	3
Jakarta	2
Johannesburg	4
London	17
Paris	8
Vienna	1
Berlin	5
Amman	6
Washington DC	7
United Kingdom	152
Manchester	1
Yemen	189
India	50
Hyderabad	1
Colombo Kollupitiya	1
Namibia	10
Germany	31
Palestinian Territories	1
Sweden	3
Iran	210
Kerman	6
Lebanon	178
Bethlehem	4
Nairoukh	1
China	30
Italy	10
Spain	7
Turkey	25
Shawawra	1
Hague	39
Khan Younis	23
Syria	84
Mazzeh	2
Damascus	17
Houthis	3
Red Sea	250
BabelMandeb Strait	1
Gulf of Aden	27
Sanaa	15
Hodeidah	5
Taiz	2
Dhamar	1
alBayda	1
Saada	3
Arabian Sea	6
Bab alMandeb Strait	9
Asia	18
Europe	30
Kuwait	2
Middle East	102
Ankara	7
West	24
Tehran	25
South

# Geocoding

Geocoding is the process of finding coordinates for a place.

The process uses APIs, Application Programming Interfaces,
which are internet services that are designed not for human reading
but for being called by applications.

There are many APIs that provide geocoding services. They typically have a database of place names and their coordinates. If you send a geocoding API a place name, it will return its coordinates (and perhaps some other data). Many of them are not free. In our case, we'll use the free GeoNames API to find our place names.

First, try it out by pasting the following URL in your browser (make sure to replace `<your_user_name>` with your geonames user name:

`http://api.geonames.org/searchJSON?q=Gaza&maxRows=5&username=<your_user_name>`

Paste the response here:

{"totalResultsCount":5276,"geonames":[{"adminCode1":"GZ","lng":"34.46672","geonameId":281133,"toponymName":"Gaza","countryId":"6254930","fcl":"P","population":410000,"countryCode":"PS","name":"Gaza","fclName":"city, village,...","adminCodes1":{},"countryName":"Palestine","fcodeName":"seat of a first-order administrative division","adminName1":"Gaza Strip","lat":"31.50161","fcode":"PPLA"},{"adminCode1":"GZ","lng":"34.48347","geonameId":281129,"toponymName":"Jabālyā","countryId":"6254930","fcl":"P","population":168568,"countryCode":"PS","name":"Jabalia","fclName":"city, village,...","adminCodes1":{},"countryName":"Palestine","fcodeName":"populated place","adminName1":"Gaza Strip","lat":"31.5272","fcode":"PPL"},{"adminCode1":"GZ","lng":"34.30627","geonameId":281124,"toponymName":"Khān Yūnis","countryId":"6254930","fcl":"P","population":173183,"countryCode":"PS","name":"Khan Yunis","fclName":"city, village,...","adminCodes1":{},"countryName":"Palestine","fcodeName":"seat of a second-order administrative division","adminName1":"Gaza Strip","lat":"31.34018","fcode":"PPLA2"},{"adminCode1":"02","lng":"33","geonameId":1046058,"toponymName":"Gaza Province","countryId":"1036973","fcl":"A","population":1422460,"countryCode":"MZ","name":"Gaza Province","fclName":"country, state, region,...","adminCodes1":{"ISO3166_2":"G"},"countryName":"Mozambique","fcodeName":"first-order administrative division","adminName1":"Gaza Province","lat":"-23.5","fcode":"ADM1"},{"adminCode1":"GZ","lng":"34.24357","geonameId":281102,"toponymName":"Rafaḩ","countryId":"6254930","fcl":"P","population":126305,"countryCode":"PS","name":"Rafah","fclName":"city, village,...","adminCodes1":{},"countryName":"Palestine","fcodeName":"seat of a second-order administrative division","adminName1":"Gaza Strip","lat":"31.29722","fcode":"PPLA2"}]}



I have created a function, `get_coordinates` that will take your a place name and your Geonames user name as an argument and return the coordinates. Please fill in your user name and run the code cell to make the function available:

In [None]:
import requests
import time
# used presentation slide 11.2
geonames_username = "abdus.salam2"

def get_coordinates(place, username=geonames_username, fuzzy=0, timeout=1):
  """This function gets a single set of coordinates from the geonames API.

  Args:
    place (str): the place name
    username (str): your geonames user name
    fuzzy (int): 0 = exact matching, 1 = fuzzy matching (allow similar but not exact matches)
    timeout (int): number of seconds to wait before a call to the geonames API
      (to avoid being blocked for overloading the server)

  Returns:
    dictionary: keys: latitude, longitude
  """
  # wait a short while, so that we don't overload the server:
  time.sleep(timeout)
  # make the API call:
  url = "http://api.geonames.org/searchJSON?"
  params = {"q": place, "username": username, "fuzzy": fuzzy, "maxRows": 1, "isNameRequired": True}
  response = requests.get(url, params=params)
  # convert the response into a dictionary:
  results = response.json()
  print(results)
  # get the first result:
  try:
    result = results["geonames"][0]
    return {"latitude": result["lat"], "longitude": result["lng"]}
  except (IndexError, KeyError):
    print("No results found for your API call", response.request.url)

In [None]:
# Input and output filenames
input_filename = "ner_counts.tsv"
output_filename = "ner_gazetteer.tsv"

# Read place names from ner_counts.tsv
with open(input_filename, "r", encoding="utf-8") as file:
    lines = file.readlines()

# Skip the header and extract place names
place_names = [line.strip().split("\t")[0] for line in lines[1:]]

# Write results to ner_gazetteer.tsv
with open(output_filename, "w", encoding="utf-8") as out_file:
    out_file.write("Name\tLatitude\tLongitude\n")

    for name in place_names:
        coordinates = get_coordinates(name)
        if coordinates:
            lat = coordinates['latitude']
            lon = coordinates['longitude']
            out_file.write(f"{name}\t{lat}\t{lon}\n")
        else:
            out_file.write(f"{name}\tNA\tNA\n")

# Display the file
with open(output_filename, encoding="utf-8") as file:
    print(file.read())

{'totalResultsCount': 10, 'geonames': [{'adminCode1': '00', 'lng': '-10', 'geonameId': 2542007, 'toponymName': 'Kingdom of Morocco', 'countryId': '2542007', 'fcl': 'A', 'population': 36029138, 'countryCode': 'MA', 'name': 'Morocco', 'fclName': 'country, state, region,...', 'countryName': 'Morocco', 'fcodeName': 'independent political entity', 'adminName1': '', 'lat': '28.5', 'fcode': 'PCLI'}]}
{'totalResultsCount': 33, 'geonames': [{'adminCode1': '00', 'lng': '34.75', 'geonameId': 294640, 'toponymName': 'State of Israel', 'countryId': '294640', 'fcl': 'A', 'population': 8883800, 'countryCode': 'IL', 'name': 'Israel', 'fclName': 'country, state, region,...', 'countryName': 'Israel', 'fcodeName': 'independent political entity', 'adminName1': '', 'lat': '31.5', 'fcode': 'PCLI'}]}
{'totalResultsCount': 40, 'geonames': [{'adminCode1': 'GZ', 'lng': '34.46672', 'geonameId': 281133, 'toponymName': 'Gaza', 'countryId': '6254930', 'fcl': 'P', 'population': 410000, 'countryCode': 'PS', 'name': 'G

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
