# Using stanza for Named Entity Recognition (continued)

In [1]:
#installing stanza library
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [2]:
#for NLP, importing stanza library
#including NER also
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [3]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


#multiple files
Since we can do this in one file, we can also do this for a large number of files!
Let's download our FASDH25-portfolio2 git repository here. Because we don't use Python to clone a git repository, we add an exclamation mark before the command git in Colab (as we did with pip). Complete the command below and run it:


In [4]:
#now, cloning the repository
!git clone https://github.com/sara-baig/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4404, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 4404 (delta 14), reused 5 (delta 4), pack-reused 4377 (from 2)[K
Receiving objects: 100% (4404/4404), 19.43 MiB | 17.01 MiB/s, done.
Resolving deltas: 100% (26/26), done.


#Filtering out articles from 2024 only

In [5]:
import os

#create an  empty dictionary that will contain our places with their frequencies:
places = {}
# loop through all the files in the folder:
folder = "/content/FASDH25-portfolio2/articles"
jan_2024_article_count = 0

for filename in os.listdir(folder):
  if "2024-01" in filename:
    jan_2024_article_count += 1
    #setting the pathway:
    path = os.path.join(folder, filename)
    #open and read the file:
    with open(path, encoding="utf-8") as file:
      text = file.read()
      #use the nlp pipeline to analyze the text:
      doc = nlp(text) #help from chatgpt (position of the code)
      # select only the entities that are place names:
      for e in doc.entities:
        if e.type in ["GPE", "LOC"]:
          place = e.text.strip()
          places[place] = places.get(place, 0) + 1 #help from chatgpt

print("Number of articles from January 2024:", jan_2024_article_count)
print(places)

Number of articles from January 2024: 326
{'Israel': 1593, 'Gaza': 1605, 'Palestine': 124, 'the United States': 97, 'Welch’s': 1, 'US': 706, 'Iraq': 62, 'United States': 40, 'West': 24, 'the Global South': 2, 'Qatar': 64, 'Gulf': 10, 'Egypt': 43, 'East Jerusalem': 23, 'Netanyahu’s': 7, 'Gaza Strip': 31, 'the Gaza Strip': 123, 'South Africa': 200, 'Russia': 43, 'Ukraine': 47, 'China': 28, 'South Africa’s': 8, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 42, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 25, 'The Hague': 33, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'The United States': 21, 'The United Kingdom': 3, 'Myanmar': 6, 'Beirut': 84, 'Dahiyeh': 6, 'Lebanon': 175, 'Iran': 206, 'Yemen': 182, 'Beirut’s Shatila': 1, 'Red Sea': 50, 'Africa': 29, 'the Red Sea': 194, 'Gulf of Aden': 4, 'the Cape of Good Hope': 12, 'Singapore': 2, 'the Gulf of Aden': 23, 'The Red Sea': 5, 'Mediterranean': 11, 'the Indian Ocea

# Cleaning name entities

1.   List item
2.   List item



In [6]:
import re

normalized_places = {}

for place, count in places.items(): #help from chat gpt

    # removing possessive endings like 's or ’s
    place = re.sub(r"[’'`]s\b", "", place)

    # striping punctuation characters
    place = re.sub(r"[^\w\s]", "", place)

    # removing leading 'the', e.g., "The United States" to "United States"
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    # combining counts for places with normalized same name
    if place in normalized_places:# help from chatgpt
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# printing normalized place names with their total counts
print(normalized_places)

{'Israel': 1625, 'Gaza': 1623, 'Palestine': 124, 'United States': 160, 'Welch': 1, 'US': 717, 'Iraq': 64, 'West': 24, 'Global South': 2, 'Qatar': 65, 'Gulf': 10, 'Egypt': 44, 'East Jerusalem': 23, 'Netanyahu': 7, 'Gaza Strip': 159, 'South Africa': 208, 'Russia': 43, 'Ukraine': 47, 'China': 30, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 43, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 102, 'Hague': 39, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'United Kingdom': 43, 'Myanmar': 6, 'Beirut': 87, 'Dahiyeh': 6, 'Lebanon': 178, 'Iran': 209, 'Yemen': 188, 'Beirut Shatila': 1, 'Red Sea': 250, 'Africa': 29, 'Gulf of Aden': 27, 'Cape of Good Hope': 12, 'Singapore': 2, 'Mediterranean': 12, 'Indian Ocean': 2, 'Europe': 30, 'Asia': 18, 'Spain': 7, 'Canada': 42, 'Australia': 13, 'Britain': 14, 'Germany': 31, 'Italy': 10, 'Switzerland': 9, 'Finland': 3, 'Estonia': 1, 'Japan': 9, 'Austria': 3, 'Romania': 4, 'West Bank': 

### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

In [7]:
filename = "ner_counts.tsv"
# open the file in writing mode and with unicode UTF-8 encoding:
with open(filename, mode= "w", encoding= "utf-8") as file:
  # create a header of the tsv files, which consists of the column names separated by a tab:
  header = "Place\tCount\n"
  # write the header to the file:
  file.write(header)
  # Now, loop through the places dictionary and create a new row for each item in the dictionary
  for place, count in normalized_places.items():#help from chatgpt
    row = f"{place}\t{count}\n"
    # finally, write the row to the file:
    file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [8]:
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
Israel	1625
Gaza	1623
Palestine	124
United States	160
Welch	1
US	717
Iraq	64
West	24
Global South	2
Qatar	65
Gulf	10
Egypt	44
East Jerusalem	23
Netanyahu	7
Gaza Strip	159
South Africa	208
Russia	43
Ukraine	47
China	30
Malaysia	8
Turkey	25
Jordan	43
Bolivia	4
Maldives	1
Namibia	10
Pakistan	24
Columbia	3
Khan Younis	23
Middle East	102
Hague	39
Bangladesh	2
Comoros	2
Djibouti	4
Netherlands	14
United Kingdom	43
Myanmar	6
Beirut	87
Dahiyeh	6
Lebanon	178
Iran	209
Yemen	188
Beirut Shatila	1
Red Sea	250
Africa	29
Gulf of Aden	27
Cape of Good Hope	12
Singapore	2
Mediterranean	12
Indian Ocean	2
Europe	30
Asia	18
Spain	7
Canada	42
Australia	13
Britain	14
Germany	31
Italy	10
Switzerland	9
Finland	3
Estonia	1
Japan	9
Austria	3
Romania	4
West Bank	162
Syria	84
October7	2
Jerusalem	26
Dearborn	12
Michigan	12
Mackinac Island	1
Great Lakes	1
Lake Michigan	1
Afghanistan	7
Texas	3
Beit Nabala	1
Idlib	3
Hamas	6
Tel Aviv	51
Washington	62
Cairo	6
Doha	19
Nuseirat	11
Central Gaza Strip	2
Deir elB

# Geocoding

Geocoding is the process of finding coordinates for a place.

The process uses APIs, Application Programming Interfaces,
which are internet services that are designed not for human reading
but for being called by applications.

There are many APIs that provide geocoding services. They typically have a database of place names and their coordinates. If you send a geocoding API a place name, it will return its coordinates (and perhaps some other data). Many of them are not free. In our case, we'll use the free GeoNames API to find our place names.

First, try it out by pasting the following URL in your browser (make sure to replace `<your_user_name>` with your geonames user name:

`http://api.geonames.org/searchJSON?q=Gaza&maxRows=5&username=<your_user_name>`

Paste the response here:
{
  "totalResultsCount": 5276,
  "geonames": [
    {
      "adminCode1": "GZ",
      "lng": "34.46672",
      "geonameId": 281133,
      "toponymName": "Gaza",
      "countryId": "6254930",
      "fcl": "P",
      "population": 410000,
      "countryCode": "PS",
      "name": "Gaza",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a first-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.50161",
      "fcode": "PPLA"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.48347",
      "geonameId": 281129,
      "toponymName": "Jabālyā",
      "countryId": "6254930",
      "fcl": "P",
      "population": 168568,
      "countryCode": "PS",
      "name": "Jabalia",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "populated place",
      "adminName1": "Gaza Strip",
      "lat": "31.5272",
      "fcode": "PPL"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.30627",
      "geonameId": 281124,
      "toponymName": "Khān Yūnis",
      "countryId": "6254930",
      "fcl": "P",
      "population": 173183,
      "countryCode": "PS",
      "name": "Khan Yunis",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.34018",
      "fcode": "PPLA2"
    },
    {
      "adminCode1": "02",
      "lng": "33",
      "geonameId": 1046058,
      "toponymName": "Gaza Province",
      "countryId": "1036973",
      "fcl": "A",
      "population": 1422460,
      "countryCode": "MZ",
      "name": "Gaza Province",
      "fclName": "country, state, region,...",
      "adminCodes1": {
        "ISO3166_2": "G"
      },
      "countryName": "Mozambique",
      "fcodeName": "first-order administrative division",
      "adminName1": "Gaza Province",
      "lat": "-23.5",
      "fcode": "ADM1"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.24357",
      "geonameId": 281102,
      "toponymName": "Rafaḩ",
      "countryId": "6254930",
      "fcl": "P",
      "population": 126305,
      "countryCode": "PS",
      "name": "Rafah",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.29722",
      "fcode": "PPLA2"
    }
  ]
}


I have created a function, `get_coordinates` that will take your a place name and your Geonames user name as an argument and return the coordinates. Please fill in your user name and run the code cell to make the function available:

In [9]:
import requests
import time

geonames_username = "dilawaiz.deeyadidar"

def get_coordinates(place, username=geonames_username, fuzzy=0, timeout=1): # Help from chatgpt
  """This function gets a single set of coordinates from the geonames API.

  Args:
    place (str): the place name
    username (str): your geonames user name
    fuzzy (int): 0 = exact matching, 1 = fuzzy matching (allow similar but not exact matches)
    timeout (int): number of seconds to wait before a call to the geonames API
      (to avoid being blocked for overloading the server)

  Returns:
    dictionary: keys: latitude, longitude
  """
  # wait a short while, so that we don't overload the server:
  time.sleep(timeout) #help from chatgpt
  # make the API call:
  url = "http://api.geonames.org/searchJSON?" #help from chatgpt
  params = {"q": place, "username": username, "fuzzy": fuzzy, "maxRows": 1, "isNameRequired": True}#help from chatgpt
  response = requests.get(url, params=params) #help from chatgpt
  # convert the response into a dictionary:
  results = response.json()
  print(results)
  # get the first result:
  try: #help from chatgpt
    result = results["geonames"][0]
    return {"latitude": result["lat"], "longitude": result["lng"]}
  except (IndexError, KeyError):
    print("No results found for your API call", response.request.url)

Now, reuse the code above to get the coordinates for the place names from the places we stored in the `ner_counts.tsv` file.

Write a new tsv file, `ner_gazetteer.tsv`, which contains three columns: name, latitude, longitude.

In [10]:
#input file
input_file = "ner_counts.tsv"
#output_file
output_file = "ner_gazetteer.tsv"

#read the place names from ner_counts.tsv
with open(input_file, "r", encoding="utf-8") as file:
  lines = file.readlines()

#skip the header and extract place names
place_names = [line.strip().split("\t")[0] for line in lines[1:]]

#write results to ner_gazetteer.tsv
with open(output_file, "w", encoding="utf-8") as out_file: #help from chatgpt
  out_file.write("Name\tLatitude\tLongitude\n") #help from chatgpt

  for name in place_names:
      coordinates = get_coordinates(name)
      if coordinates:
        lat = coordinates['latitude']
        lon = coordinates['longitude']
        out_file.write(f"{name}\t{lat}\t{lon}\n")
      else:
        out_file.write(f"{name}\tNA\tNA\n")

#display the play
with open(output_file, encoding="utf-8") as file:
  print(file.read())



{'totalResultsCount': 33, 'geonames': [{'adminCode1': '00', 'lng': '34.75', 'geonameId': 294640, 'toponymName': 'State of Israel', 'countryId': '294640', 'fcl': 'A', 'population': 8883800, 'countryCode': 'IL', 'name': 'Israel', 'fclName': 'country, state, region,...', 'countryName': 'Israel', 'fcodeName': 'independent political entity', 'adminName1': '', 'lat': '31.5', 'fcode': 'PCLI'}]}
{'totalResultsCount': 40, 'geonames': [{'adminCode1': 'GZ', 'lng': '34.46672', 'geonameId': 281133, 'toponymName': 'Gaza', 'countryId': '6254930', 'fcl': 'P', 'population': 410000, 'countryCode': 'PS', 'name': 'Gaza', 'fclName': 'city, village,...', 'adminCodes1': {}, 'countryName': 'Palestine', 'fcodeName': 'seat of a first-order administrative division', 'adminName1': 'Gaza Strip', 'lat': '31.50161', 'fcode': 'PPLA'}]}
{'totalResultsCount': 49, 'geonames': [{'adminCode1': '00', 'lng': '35.20329', 'geonameId': 6254930, 'toponymName': 'Palestine', 'countryId': '6254930', 'fcl': 'A', 'population': 45690