<a href="https://colab.research.google.com/github/rickrari/Creating-a-Custom-spaCy-NER-Model-to-recognize-lowercase-path-/blob/main/PATH_NER_Analysis_CustomModel_Codebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Analysis**

This section continues the explanation of the project, showing how the author tested spaCy’s ability to recognize PATH as an entity, both when capitalized and in lowercase letters. The section also shows how the author created a custom spaCy model; annotating data by hand, converting it into spaCy’s JSON format, realizing the data was not structured properly, re-formatting and re-annotating the data by hand, re-converting it into a JSON file, and ultimately successfully creating a spaCy model that recognizes lowercase ‘path’ as an organization.

The following seven lines of code are adapted from [Krisel, NER_Workshop, 2024](https://github.com/rskrisel/NER_workshop/blob/main/NER_workshop.ipynb). This code imports the libraries and models necessary to run a Named Entity Recognition (NER) analysis.

In [1]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting blis<1.2.0,>=1.1.0 (from thinc<8.4.0,>=8.3.0->spacy)
  Downloading blis-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading spacy-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.1/29.1 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading thinc-8.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading blis-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m76.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.7.1
    Uninstalling en-core-web-sm-3.7.1:
      Successfully uninstalled en-core-web-sm-3.7.1
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


This code imports all necessary libraries.

In [3]:
import spacy
from spacy import displacy
import en_core_web_sm
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400
import glob
from pathlib import Path
import requests
import pprint
from bs4 import BeautifulSoup

en_core_web_sm is a general purpose, English based spaCy model capable of recognizing parts of speech, dependencies, and named entities. This is the smallest model, ideal for general purpose analysis that prioritizes speed. spaCy also offers medium (en_core_web_md) and large (_lg) models that require more memory and time to operate but can provide greater accuracy.

In [4]:
nlp = en_core_web_sm.load()

This code defines a variable called "filepath" and assigns it the path of the text file, then stores the contents of the text file in a variable called "text" that is encoded with utf-8, which ensures the text is read with the correct character encoding. Finally, the code processes the text using the spaCy nlp model loaded above, and stores this processed text as an object called "doc".

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_1.txt" #filepath created in PATH_NER_DataCollection Codebook, replace as relevant
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

This code renders all entities spaCy recognizes in the "doc" object, making visible to the human eye how spaCy has categorized different entities. Note, spaCy mostly recognizes that PATH is an organization (ORG), but not 100% of the time. In the first instance in this text, spaCy categorized PATH as a person.

In [None]:
displacy.render(doc, style="ent")

This code provides a list of all entities spaCy identifies.

In [None]:
doc.ents

(a quarter century,
 the Port Authority Trans-Hudson,
 PATH,
 Douglas John Bowen,
 the New Jersey Association of Railroad Passengers,
 West Orange,
 rider advocacy group,
 more than 200,
 the eve of the 25th anniversary,
 the Port Authority's,
 PATH,
 210,000,
 New Jersey,
 Lower,
 Manhattan,
 Hudson River,
 the Port Authority of New York,
 New Jersey,
 PATH,
 $200 million,
 95,
 Two thirds,
 20 percent,
 248,
 PATH,
 1965,
 next spring,
 About 80,
 the Port Authority,
 1986,
 1,431,
 190,
 1984,
 121,
 a month,
 July,
 120,
 1986,
 Richard R. Kelly,
 the PATH Corporation,
 PATH,
 38,
 Kelly,
 the Port Authority,
 two years ago,
 Jersey City,
 Jersey City,
 four years ago,
 PATH,
 Hyman Sprekman,
 Equitable Life,
 Hoboken,
 one morning,
 last week,
 West 33rd Street,
 Manhattan,
 half-dozen,
 PATH,
 14-mile,
 1,130,
 Hudson River,
 Newark,
 Harrison,
 Jersey City,
 Hoboken,
 Manhattan,
 the World Trade Center,
 Christopher Street,
 33rd Street,
 Hudson,
 Manhattan Railroad,
 the Hudson

To assess how well the pre-loaded spaCy model does at identifying PATH as an ORG, the author re-ran the two lines of code below (which 1 - create a doc object with encoded text for the specified file, and 2 - render visible how spaCy categorizes entities in the specified file). In total, the author selected 10 random files to check manually, by using control-f in Microsoft Word to identify all instances of PATH and verify whether the spaCy model correctly identified that PATH was an ORG. This process (checking files 1,100,16,45,73,64,28,55,87,34) yielded 48 instances of "PATH", 46 (96%) of which the model correctly identified as an ORG.

The count can be seen in this [spreadsheet](https://docs.google.com/spreadsheets/d/1fdYyYqKSd--WGtIIpc6jfCu8v772JPej/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true) (entries labeled UPPER).

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_100.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
doc.ents

($154 million,
 the New York City Housing Authority,
 Philip B. Banks III,
 New York City,
 two years,
 Eric Adams,
 Banks,
 Banks,
 years earlier,
 millions of dollars,
 The New York Times,
 City Safe Partners,
 $154 million,
 the New York City Housing Authority,
 January 2024,
 Brooklyn,
 Manhattan,
 Bronx,
 Sheena Wright,
 first,
 Adams,
 Banks,
 the housing authority's,
 Banks,
 this month,
 Banks,
 David Banks,
 Terence Banks,
 Philip Banks,
 one,
 at least four,
 Adams,
 City Hall,
 Adams,
 City Safe Partners,
 Banks,
 Adams,
 U.S.,
 the Southern District,
 New York,
 three,
 four,
 Department of Investigation,
 Banks,
 2015,
 City Hall,
 Times,
 last year,
 2018,
 Banks,
 City Safe Partners,
 Banks,
 100 percent,
 July 9, 2018,
 Benjamin Brafman,
 Banks,
 City Safe,
 Overwatch Services LLC,
 Banks,
 Overwatch,
 less than two years,
 Brafman,
 Friday,
 Liz Garcia,
 Banks,
 years,
 Sheena Wright,
 Xavier R. Donaldson,
 2010,
 Dwayne Montgomery,
 Adams,
 the Police Department,
 Mon

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_16.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
for named_entity in doc.ents:
    print(named_entity, named_entity.label_)

first ORDINAL
1 MONEY
the George Washington Bridge ORG
Lincoln ORG
Holland Tunnels ORG
PATH ORG
the Port Authority Trans-Hudson ORG
Hudson LOC
PATH ORG
Stephen Berger PERSON
the Port Authority ORG
trans-Hudson NORP
PATH ORG
the Port Authority's ORG
trans-Hudson NORP
PATH ORG
$5.8 billion MONEY
nearly half CARDINAL
three CARDINAL
1987 DATE
Berger PERSON
5-year DATE
Next year DATE
$1.3 billion MONEY
Berger PERSON
only $200 million MONEY
The Port Authority's ORG
Berger PERSON
New York City's GPE
the Williamsburg Bridge ORG
Berger PERSON
$300 million MONEY
Newark Airport FAC
First ORDINAL
Berger PERSON
$300 million MONEY
second ORDINAL
Newark GPE
Elizabeth PERSON
second ORDINAL
two CARDINAL
1987 DATE
29 million CARDINAL
45 million CARDINAL
the year 2000 DATE
Today DATE
21 million CARDINAL
a year DATE
People Express ORG
one CARDINAL
the year 2000 DATE
only a decade DATE
Newark Airport FAC
The Port Authority ORG
one CARDINAL
the Metropolitan Region LOC
$25 million MONEY
Brooklyn GPE
the Fult

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_45.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_73.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_64.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_28.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_55.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_87.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_34.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

This code, generated by Gemini based on the included prompt, created a lowercase copy of all the Factiva text files and stored the copies in a new folder in the same Drive.

In [None]:
# prompt: Hello, how can I turn all the text stored in the files in this folder [/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies] to lower case?

import os
import glob

def convert_to_lowercase(input_dir, output_dir):
    """Converts all text files in a directory to lowercase and saves them to a new directory.
    """

    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    for filename in glob.glob(os.path.join(input_dir, "*.txt")):
        with open(filename, 'r', encoding='utf-8') as infile:
            try:
                text = infile.read().lower()
            except UnicodeDecodeError:
                print(f"Skipping file due to decoding error: {filename}")
                continue  # Skip the file if decoding fails

        output_filename = os.path.join(output_dir, os.path.basename(filename))
        with open(output_filename, 'w', encoding='utf-8') as outfile:
            outfile.write(text)

# Example usage:
input_directory = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies"
output_directory = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output"  # Change to your desired output directory

convert_to_lowercase(input_directory, output_directory)

The author repeated the above outlined process to manually check if the spaCy model could correctly recognize "path" as an ORG in the lowercase files. The author checked the same 10 files and found that the spaCy model did not correctly identify a single instance of lowercase "path". This result proves the need to create a custom spaCy model capable of recognizing lowercase "path".

The count can be seen in this [spreadsheet](https://docs.google.com/spreadsheets/d/1fdYyYqKSd--WGtIIpc6jfCu8v772JPej/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true) (entries labeled lower).

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_1.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_100.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_16.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, style="ent")

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_45.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

The following several lines of code are part of a failed attempt to create spaCy model capable of recognizing "path". Based on advice from [Gemini](https://docs.google.com/document/d/1xD7D5rlBM9Xy49AWZIpf2ai_DVQl4h_J/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true), the author used python to find the begining and ending position of each instance of "path" in the above sampled text files. The author then stored that information in a [spreadsheet](https://docs.google.com/spreadsheets/d/13XCvm59AuoztD2aH7qN5MSkv4Qqb4ziJyzB4MkpHDnM/edit?usp=sharing) that had the text in 1 column, then three columns for each instance of path (entity, a start-position, and end-position). The author also added two self-generated practice texts with 16 instances of "path", five of which were examples of the common word path - included to train the model against false positives.

The longest text had 23 instances of "path", so this row of data had 73 columns (3 labeling columns + 1 text column + 23 X 3). The complexity of this structure is likely why this attempt failed.

In [None]:
import re
#used the re (regular expressions) module to count the begining and end position
#of path in each text file (based on code from chatgpt). Used this to manually
#encode each instance of path in a google sheet. This sheet (as a csv) will
#later be turned into pandas df and converted to spaCy JSON to create a new spaCy model

# Filepath to your text file
file_path = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_1.txt"

# Read the content of the file
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Use regular expressions to find all occurrences of "path"
matches = re.finditer(r'\bpath\b', text)  # \b ensures we match "path" as a whole word

# Store the positions (start and end)
positions = [(match.start(), match.end()) for match in matches]

# Print the results
print("Positions of 'path':")
for start, end in positions:
    print(f"Start: {start}, End: {end}")


Positions of 'path':
Start: 130, End: 134
Start: 474, End: 478
Start: 872, End: 876
Start: 1155, End: 1159
Start: 2178, End: 2182
Start: 2231, End: 2235
Start: 2771, End: 2775
Start: 3136, End: 3140
Start: 3679, End: 3683
Start: 3865, End: 3869
Start: 4075, End: 4079
Start: 4463, End: 4467
Start: 5075, End: 5079
Start: 5622, End: 5626
Start: 6750, End: 6754


In [None]:
# Filepath to your text file
file_path = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_16.txt"

# Read the content of the file
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Use regular expressions to find all occurrences of "path"
matches = re.finditer(r'\bpath\b', text)  # \b ensures we match "path" as a whole word

# Store the positions (start and end)
positions = [(match.start(), match.end()) for match in matches]

# Print the results
print("Positions of 'path':")
for start, end in positions:
    print(f"Start: {start}, End: {end}")

Positions of 'path':
Start: 302, End: 306
Start: 481, End: 485
Start: 660, End: 664
Start: 1016, End: 1020


In [None]:
# Filepath to your text file
file_path = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_73.txt"

# Read the content of the file
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Use regular expressions to find all occurrences of "path"
matches = re.finditer(r'\bpath\b', text)  # \b ensures we match "path" as a whole word

# Store the positions (start and end)
positions = [(match.start(), match.end()) for match in matches]

# Print the results
print("Positions of 'path':")
for start, end in positions:
    print(f"Start: {start}, End: {end}")

Positions of 'path':
Start: 588, End: 592


In [None]:
# Filepath to your text file
file_path = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_64.txt"

# Read the content of the file
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Use regular expressions to find all occurrences of "path"
matches = re.finditer(r'\bpath\b', text)  # \b ensures we match "path" as a whole word

# Store the positions (start and end)
positions = [(match.start(), match.end()) for match in matches]

# Print the results
print("Positions of 'path':")
for start, end in positions:
    print(f"Start: {start}, End: {end}")

Positions of 'path':
Start: 4037, End: 4041
Start: 4768, End: 4772
Start: 4965, End: 4969


Based on input from [Gemini](https://docs.google.com/document/d/1xD7D5rlBM9Xy49AWZIpf2ai_DVQl4h_J/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true), the following nine lines of code convert the training data in the spreedsheet into a pandas df, then to the JSON format spaCy uses.

In [5]:
#uploads the training data from the google sheet csv into a DataFrame
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/PythonCourse/FinalProject/NER_Training_Sheet_csv.csv')

In [6]:
print (df)

                   File                                              Notes  \
0  Copy of text_file_45                                                NaN   
1   Copy of text_file_1                                                NaN   
2  Copy of text_file_16                  Augmented with 1 non-entity PATH    
3  Copy of text_file_73                                                NaN   
4  Copy of text_file_64                                                NaN   
5  Copy of text_file_34                  Augmented with 1 non-entity PATH    
6        Practice_Text1  Contains 8 instances of path, 5 entity-instances    
7        Practice_Text1  Contains 8 instances of path, 6 entity-instances    

                                                Text Entity 1  Start 1  \
0  in a dispute that could cause tensions between...     path    239.0   
1  it has taken a quarter century of sporadic tra...     path    130.0   
2  the first reaction to the proposed $1 toll inc...     path    302.0   
3

In [9]:
print(df.columns)

Index(['File', 'Notes', 'Text', 'Entity 1', 'Start 1', 'End 1', 'Entity 2',
       'Start 2', 'End 2', 'Entity 3', 'Start 3', 'End 3', 'Entity 4',
       'Start 4', 'End 4', 'Entity 5', 'Start 5', 'End 5', 'Entity 6',
       'Start 6', 'End 6', 'Entity 7', 'Start 7', 'End 7', 'Entity 8',
       'Start 8', 'End 8', 'Entity 9', 'Start 9', 'End 9', 'Entity 10',
       'Start 10', 'End 10', 'Entity 11', 'Start 11', 'End 11', 'Entity 12',
       'Start 12', 'End 12', 'Entity 13', 'Start 13', 'End 13', 'Entity 14',
       'Start 14', 'End 14', 'Entity 15', 'Start 15', 'End 15', 'Entity 16',
       'Start 16', 'End 16', 'Entity 17', 'Start 17', 'End 17', 'Entity 18',
       'Start 18', 'End 18', 'Entity 19', 'Start 19', 'End 19', 'Entity 20',
       'Start 20', 'End 20', 'Entity 21', 'Start 21', 'End 21', 'Entity 22',
       'Start 22', 'End 22', 'Entity 23', 'Start 23', 'End 23'],
      dtype='object')


In [7]:
json_result = df.to_json()

In [8]:
print(json_result)

{"File":{"0":"Copy of text_file_45","1":"Copy of text_file_1","2":"Copy of text_file_16","3":"Copy of text_file_73","4":"Copy of text_file_64","5":"Copy of text_file_34","6":"Practice_Text1","7":"Practice_Text1"},"Notes":{"0":null,"1":null,"2":"Augmented with 1 non-entity PATH ","3":null,"4":null,"5":"Augmented with 1 non-entity PATH ","6":"Contains 8 instances of path, 5 entity-instances ","7":"Contains 8 instances of path, 6 entity-instances "},"Text":{"0":"in a dispute that could cause tensions between their republican governors, new york and new jersey are increasingly at odds over whether to raise the tolls on six bridges and tunnels that connect the two states, as well as the fare on the path commuter railway. officials in new jersey are lining up against a higher path fare and tolls, fearful of angering their constituents, who use these services more than new yorkers do. but many of their new york counterparts favor raising at least some of these fees, which, they maintain, are 

In [10]:
import json

training_data = []
for index, row in df.iterrows():
    entities = []
    for i in range(1, 70, 3):  # Access up to "Entity 23" and corresponding Start, End
        entity = row[f'Entity {i // 3 + 1}']
        if pd.notna(entity):  # Check if entity is not NaN
            start = int(row[f'Start {i // 3 + 1}'])
            end = int(row[f'End {i // 3 + 1}'])
            entities.append({'start': start, 'end': end, 'label': entity})
    training_data.append({'text': row['Text'], 'spans': entities})

with open('train_data.json', 'w') as f:
    json.dump(training_data, f, indent=4)  # 'indent' for readability

In [12]:
print(training_data)

[{'text': 'in a dispute that could cause tensions between their republican governors, new york and new jersey are increasingly at odds over whether to raise the tolls on six bridges and tunnels that connect the two states, as well as the fare on the path commuter railway. officials in new jersey are lining up against a higher path fare and tolls, fearful of angering their constituents, who use these services more than new yorkers do. but many of their new york counterparts favor raising at least some of these fees, which, they maintain, are being kept artificially low. the issue is flaring now because of a budget deficit at the port authority of new york and new jersey, the bistate agency that runs the path and the six crossings: the george washington bridge, the lincoln and holland tunnels and the three bridges between staten island and new jersey. the port authority tolls have not risen since 1991, and the path, or port authority trans-hudson, fare has been $1 since 1987. and earlier

In [31]:
with open('train_data.json', 'r') as f:
       train_data = json.load(f)
       #Open and Load the File: Use the with open() statement to open the file and json.load() to read its contents into a variable:

In [30]:
print(train_data)

[{'text': 'in a dispute that could cause tensions between their republican governors, new york and new jersey are increasingly at odds over whether to raise the tolls on six bridges and tunnels that connect the two states, as well as the fare on the path commuter railway. officials in new jersey are lining up against a higher path fare and tolls, fearful of angering their constituents, who use these services more than new yorkers do. but many of their new york counterparts favor raising at least some of these fees, which, they maintain, are being kept artificially low. the issue is flaring now because of a budget deficit at the port authority of new york and new jersey, the bistate agency that runs the path and the six crossings: the george washington bridge, the lincoln and holland tunnels and the three bridges between staten island and new jersey. the port authority tolls have not risen since 1991, and the path, or port authority trans-hudson, fare has been $1 since 1987. and earlier

Based on input from [Gemini](https://docs.google.com/document/d/1xD7D5rlBM9Xy49AWZIpf2ai_DVQl4h_J/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true) the following 11 lines of code attempt to use the JSON formatted training data to create a custom spaCy model. Ultimately the attempt failed, generated an assertion error, indicating something likely went wrong converting the data properly to JSON format.

In [9]:
!pip install -U spacy



In [10]:
!pip install -U spacy-transformers



In [16]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force
#had to add the force command at the end because initial attempt to generate config
#didn't include GPU, and would not have been as effective. Fixed by adding above
#code (transformers) and re-running this code with force at end

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [18]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [4]:
!python -m spacy train config.cfg --output ./output --paths.train train_data.json --paths.dev train_data.json

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.10/dist-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-

In [24]:
!python -m spacy debug data config.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[1m
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.10/dist-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(su

In [None]:
nlp = spacy.load("/content/output")  # Replace with your model path
doc = nlp("The path train was delayed.")
displacy.render(doc, style="ent")

OSError: [E053] Could not read meta.json from /content/output

In [None]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva/text_file_1.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [32]:
!pwd

/content


In [33]:
!ls -l

total 56
-rw-r--r-- 1 root root  2719 Dec 17 15:24 config.cfg
drwx------ 7 root root  4096 Dec 17 14:12 drive
drwxr-xr-x 2 root root  4096 Dec 17 14:18 output
drwxr-xr-x 1 root root  4096 Dec 12 14:22 sample_data
-rw-r--r-- 1 root root 40698 Dec 17 14:16 train_data.json


In [2]:
# Restart runtime (from the Runtime menu)

# After restart:
import spacy

# If using spacy.util.load_config():
config = spacy.util.load_config("config.cfg")

# Run training command:
!python -m spacy train config.cfg --output ./output --paths.train train_data.json --paths.dev train_data.json

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.10/dist-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-

Realizing the structre of the training data for the above attempt was likely too complex, the author tried again but this time simplified the training data so that each text example had only one sentence, with one example of "path". The author added 53 improvised examples of "path" as an ORG and removed all non-ORG "path" examples. The new [spreadsheet](https://docs.google.com/spreadsheets/d/1zfVMfk40XIxD_A_7jWCGxcRIcVTEZRmjJ0Q0FNP5YA0/edit?usp=sharing) has 102 rows of data and only six columns (2 labeling columns + the text + the entity (ORG) and start+end positions of each example of "path".

Following input from [ChatGPT](https://chatgpt.com/c/6761b0bb-bd7c-800e-ad42-2d8d36219139) (also available [here](https://docs.google.com/document/d/1M2MxxGUtpZKZgdzQLhae9BQgsllu5n40J5cSQlUYRSI/edit?usp=sharing), the author repeated the process of converting the training data to a pandas df, then to JSON format.

In [1]:
import pandas as pd

# Path to your CSV file in Google Drive
file_path = '/content/drive/MyDrive/PythonCourse/FinalProject/CSV_Single_Line_Training_Data - Sheet1.csv'

# Load the CSV into a DataFrame
df = pd.read_csv(file_path)

# Preview the DataFrame to confirm it's loaded correctly
print(df.head())


                 source  chunk  \
0  Copy of text_file_45  45.01   
1  Copy of text_file_45  45.02   
2  Copy of text_file_45  45.03   
3  Copy of text_file_45  45.04   
4  Copy of text_file_45  45.05   

                                                text entity  start  end  
0  in a dispute that could cause tensions between...    ORG    239  243  
1  officials in new jersey are lining up against ...    ORG     55   59  
2  but many of their new york counterparts favor ...    ORG    274  278  
3  the port authority tolls have not risen since ...    ORG     60   64  
4  many new york officials say they are acutely a...    ORG     92   96  


In [2]:
import json

# Prepare spaCy training data
training_data = []

for _, row in df.iterrows():
    text = row['text']
    start = row['start']
    end = row['end']
    label = row['entity']  # Should be "ORG"

    # Append in spaCy's required format
    entities = {"entities": [(start, end, label)]}
    training_data.append((text, entities))

# Save the training data to a JSON file
output_file = '/content/drive/MyDrive/PythonCourse/FinalProject/path_training_data.json'
with open(output_file, 'w') as f:
    json.dump(training_data, f)

print(f"Training data saved to {output_file}")


Training data saved to /content/drive/MyDrive/PythonCourse/FinalProject/path_training_data.json


This code checks to ensure the data successfully transferred to the JSON format.

In [3]:
# Open and read the JSON file
with open('/content/drive/MyDrive/PythonCourse/FinalProject/path_training_data.json', 'r') as f:
    data = json.load(f)

# Print the first 5 entries
for i, entry in enumerate(data[:5]):  # Adjust the number to preview more
    print(f"Entry {i + 1}: {entry}")
  #checking that the JSON file is properly written


Entry 1: ['in a dispute that could cause tensions between their republican governors, new york and new jersey are increasingly at odds over whether to raise the tolls on six bridges and tunnels that connect the two states, as well as the fare on the path commuter railway.', {'entities': [[239, 243, 'ORG']]}]
Entry 2: ['officials in new jersey are lining up against a higher path fare and tolls, fearful of angering their constituents, who use these services more than new yorkers do.', {'entities': [[55, 59, 'ORG']]}]
Entry 3: ['but many of their new york counterparts favor raising at least some of these fees, which, they maintain, are being kept artificially low. the issue is flaring now because of a budget deficit at the port authority of new york and new jersey, the bistate agency that runs the path and the six crossings: the george washington bridge, the lincoln and holland tunnels and the three bridges between staten island and new jersey.', {'entities': [[274, 278, 'ORG']]}]
Entry 4

Based on input from [ChatGPT](https://chatgpt.com/c/6761b0bb-bd7c-800e-ad42-2d8d36219139), the following three lines of code create a new spaCy model based on the JSON formatted training data.

The code then saves the new model as FinalProject/path_ner_model, in the same folder in the Drive.

In [4]:
pip install spacy




In [5]:
import spacy
from spacy.training.example import Example

# Load a blank model for English
nlp = spacy.blank("en")

# Create a blank Named Entity Recognizer (NER) pipeline
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner")

# Add the "ORG" label to the NER pipeline
ner.add_label("ORG")

# Prepare training data
import json

training_data_path = '/content/drive/MyDrive/PythonCourse/FinalProject/path_training_data.json'
with open(training_data_path, 'r') as f:
    training_data = json.load(f)

# Convert data to spaCy's Example format
examples = []
for text, annotations in training_data:
    examples.append(Example.from_dict(nlp.make_doc(text), annotations))

# Train the NER pipeline
import random
from spacy.training import Example

optimizer = nlp.initialize()

for epoch in range(30):  # Adjust the number of epochs if needed
    random.shuffle(examples)
    for example in examples:
        nlp.update([example], drop=0.5, sgd=optimizer)

# Save the trained model
output_dir = "/content/drive/MyDrive/PythonCourse/FinalProject/path_ner_model"
nlp.to_disk(output_dir)
print(f"Model saved to {output_dir}")




Model saved to /content/drive/MyDrive/PythonCourse/FinalProject/path_ner_model


This code loads the newly trained spaCy model and tests it on a simple example sentence. The results show the model recognizes "path" as an ORG.

In [6]:
# Load the trained model
import spacy

model_path = "/content/drive/MyDrive/PythonCourse/FinalProject/path_ner_model"
nlp = spacy.load(model_path)

# Test the model
test_text = "I am traveling via path to New Jersey."
doc = nlp(test_text)

for ent in doc.ents:
    print(ent.text, ent.label_)


path ORG


The rest of the code allowed the author the check the efficacy of the new model, using the above outlined technique of rendering the entities spaCy recognizes in each text and manually counting the instances of "path". A review of 10 sample documents plus another four randomly selected texts yielded 65 instances of the word path, 64 of which referred to Port Authority Trans Hudson (i.e. "path"), and one of which was the common word. The new model correctly identified 63 (98%) of the the 64 instances of "path" as ORGs, and incorrectly identified the one instance of the common word as an ORG.

The count can be seen in this [spreadsheet](https://docs.google.com/spreadsheets/d/1fdYyYqKSd--WGtIIpc6jfCu8v772JPej/edit?usp=sharing&ouid=117069048546080524054&rtpof=true&sd=true) (entries labeled lower_new_model).


In [9]:
from spacy import displacy


In [10]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_16.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [11]:
displacy.render(doc, style="ent")

In [12]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_1.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [13]:
displacy.render(doc, style="ent")

In [14]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_45.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [15]:
displacy.render(doc, style="ent")

In [16]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_100.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [17]:
displacy.render(doc, style="ent")

In [18]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_73.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [19]:
displacy.render(doc, style="ent")

In [20]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_64.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [21]:
displacy.render(doc, style="ent")

In [22]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_28.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [23]:
displacy.render(doc, style="ent")



In [24]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_34.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [25]:
displacy.render(doc, style="ent")

In [28]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_38.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [29]:
displacy.render(doc, style="ent")

In [30]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_44.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [31]:
displacy.render(doc, style="ent")

In [32]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_84.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [33]:
displacy.render(doc, style="ent")

In [35]:
filepath = "/content/drive/MyDrive/PythonCourse/FinalProject/factiva_lowercase_copies_output/Copy of text_file_7.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [36]:
displacy.render(doc, style="ent")

**Conclusions, Limitations, and Next Steps**

The model is now capable of identifying "path" as an entity, even when spelled in lowercase letters. However, the model should be further trained with more data to increase its acuracy in new contexts. Further training data should include instances of the common word path, so that the model will not give false positives (incorectly identifying the common word as an ORG). Once the model is properly trained, it should be combined with an existing spaCy model capable of recognizing other entities (possible the en_core_web_sm model used at the begining of this project). Once such a combined model exists, researchers can perform all necessary data cleaning on textual data containing "PATH". Doing so would allow researchers to better identify public sentiment toward PATH, related topics mentioned along-side PATH, and leaders or communities that hold strong opinions of the transit system.