# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [5]:
#setting stanza library
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [6]:
import stanza
import os
import requests
import time

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [7]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en" , processors= "tokenize,ner")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## cloning the repository
!git clone https://github.com/wajali7/FASDH25-portfolio2.git



In [8]:
!git clone https://github.com/wajali7/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


Create a new stanza document by feeding the `article` variable to our `nlp` pipeline object. Then print each entity (let the code cell above the previous one inspire you):

#articles january
 (2024)


In [10]:
#Settiing File pathway
import os
#prepare an empty dictionary to track how frequently each place occurs:
places = {}
#go through each file located in the folder:
folder = "/content/FASDH25-portfolio2/articles"
jan_2024_article_count = 0
for filename in os.listdir(folder):
  if "2024-01" in filename:
    jan_2024_article_count += 1
 #pathway setting
    path = os.path.join(folder, filename)
  #access the file and load its contents:
    with open(path, encoding="utf-8") as file:
      text = file.read()
    #process the text using the NLP pipeline:
      doc = nlp(text)
    #filter the entities to keep only those representing locations:
      for e in doc.entities:
        if e.type in["GPE", "LOC"]:
          place = e.text.strip()
          places[place] = places.get(place, 0) + 1

print("Number of articles from Janurary 2024:", jan_2024_article_count)
print(places)




Number of articles from Janurary 2024: 326
{'Israel': 1593, 'Gaza': 1605, 'Palestine': 124, 'the United States': 97, 'Welch’s': 1, 'US': 706, 'Iraq': 62, 'United States': 40, 'West': 24, 'the Global South': 2, 'Qatar': 64, 'Gulf': 10, 'Egypt': 43, 'East Jerusalem': 23, 'Netanyahu’s': 7, 'Gaza Strip': 31, 'the Gaza Strip': 123, 'South Africa': 200, 'Russia': 43, 'Ukraine': 47, 'China': 28, 'South Africa’s': 8, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 42, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 25, 'The Hague': 33, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'The United States': 21, 'The United Kingdom': 3, 'Myanmar': 6, 'Beirut': 84, 'Dahiyeh': 6, 'Lebanon': 175, 'Iran': 206, 'Yemen': 182, 'Beirut’s Shatila': 1, 'Red Sea': 50, 'Africa': 29, 'the Red Sea': 194, 'Gulf of Aden': 4, 'the Cape of Good Hope': 12, 'Singapore': 2, 'the Gulf of Aden': 23, 'The Red Sea': 5, 'Mediterranean': 11, 'the Indian Oce

### Place names


In [11]:
import re

#create a dictionary for cleaned place names:
normalized_places = {}

for place, count in places.items():
  #eliminate endings like 's or ’s from words: - help from ChatGPT (Conversation 1)
  place = re.sub(r"[’'`]s\b", "", place)
  #remove the article 'the' from the start of place names:
  place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)
  #adding up counts for places that have the same normalized form:
  if place in normalized_places:
    normalized_places[place] += count
  else:
    normalized_places[place] = count
#print each normalized place name with how many times it appeared:
print(normalized_places)


{'Israel': 1625, 'Gaza': 1623, 'Palestine': 124, 'United States': 158, 'Welch': 1, 'US': 706, 'Iraq': 64, 'West': 24, 'Global South': 2, 'Qatar': 65, 'Gulf': 10, 'Egypt': 44, 'East Jerusalem': 23, 'Netanyahu': 7, 'Gaza Strip': 159, 'South Africa': 208, 'Russia': 43, 'Ukraine': 47, 'China': 30, 'Malaysia': 8, 'Turkey': 25, 'Jordan': 43, 'Bolivia': 4, 'Maldives': 1, 'Namibia': 10, 'Pakistan': 24, 'Columbia': 3, 'Khan Younis': 23, 'Middle East': 102, 'Hague': 39, 'Bangladesh': 2, 'Comoros': 2, 'Djibouti': 4, 'Netherlands': 14, 'United Kingdom': 43, 'Myanmar': 6, 'Beirut': 87, 'Dahiyeh': 6, 'Lebanon': 178, 'Iran': 209, 'Yemen': 188, 'Beirut Shatila': 1, 'Red Sea': 249, 'Africa': 29, 'Gulf of Aden': 27, 'Cape of Good Hope': 12, 'Singapore': 2, 'Mediterranean': 12, 'Indian Ocean': 2, 'Europe': 30, 'Asia': 18, 'Spain': 7, 'Canada': 42, 'Australia': 13, 'Britain': 14, 'Germany': 31, 'Italy': 10, 'Switzerland': 9, 'Finland': 3, 'Estonia': 1, 'Japan': 9, 'Austria': 3, 'Romania': 4, 'West Bank': 

###Data in File (TSV)

In [12]:
filename = 'ner_counts.tsv'

#open the file for writing using UTF-8 encoding:
with open(filename, mode= 'w', encoding= 'utf-8') as file:
  #write the TSV file header with column titles separated by tabs:
  header = 'Place\tCount\n'
  #save the header line into the file:
  file.write(header)
  #go through each item in the places dictionary and write it as a new row:
  for place, count in normalized_places.items():
    row = f"{place}\t{count}\n"
    #write the current row into the file:
    file.write(row)


We can now loop through the articles in the folder as we did when we were using regex to find filenames:

In [None]:
filename = "ner_counts.tsv"
# open the file in writing mode and with unicode UTF-8 encoding:
with open(filename, mode="w", encoding="utf-8") as file:
  # create a header of the tsv files, which consists of the column names separated by a tab:
  header = "name\tfrequency\n"
  # write the header to the file:
  file.write(header)
  # Now, loop through the places dictionary and create a new row for each item in the dictionary
  for name, frequency in places.items():
    row = f"{name}\t{frequency}\n"
    # finally, write the row to the file:
    file.write(row)

In [13]:
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
Israel	1625
Gaza	1623
Palestine	124
United States	158
Welch	1
US	706
Iraq	64
West	24
Global South	2
Qatar	65
Gulf	10
Egypt	44
East Jerusalem	23
Netanyahu	7
Gaza Strip	159
South Africa	208
Russia	43
Ukraine	47
China	30
Malaysia	8
Turkey	25
Jordan	43
Bolivia	4
Maldives	1
Namibia	10
Pakistan	24
Columbia	3
Khan Younis	23
Middle East	102
Hague	39
Bangladesh	2
Comoros	2
Djibouti	4
Netherlands	14
United Kingdom	43
Myanmar	6
Beirut	87
Dahiyeh	6
Lebanon	178
Iran	209
Yemen	188
Beirut Shatila	1
Red Sea	249
Africa	29
Gulf of Aden	27
Cape of Good Hope	12
Singapore	2
Mediterranean	12
Indian Ocean	2
Europe	30
Asia	18
Spain	7
Canada	42
Australia	13
Britain	14
Germany	31
Italy	10
Switzerland	9
Finland	3
Estonia	1
Japan	9
Austria	3
Romania	4
West Bank	162
Syria	84
#October7	2
Jerusalem	26
Dearborn	12
Michigan	12
Mackinac Island	1
Great Lakes	1
Lake Michigan	1
Afghanistan	7
Texas	3
Beit Nabala	1
Idlib	3
Hamas	6
Tel Aviv	51
Washington	62
Cairo	6
Doha	19
Nuseirat	11
Central Gaza Strip	2
Deir el