<a href="https://colab.research.google.com/github/seerosem7/text_analysis_final_project/blob/main/see_final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Training NER to Process Alt-Right Data**


****



## Introduction
***




## Original Project
  Project Veritas (PV) is a self-described citizen journalist organization founded by James O’Keefe that produces controversial, undercover content with an alt-right orientation. Since 2016, numerous scholars, particularly disinformation researchers, have used data from social media to map out alternative and right-wing media networks, including Starbird (2017) and Lewis (2018). Although PV and O’Keefe are situated within what Lewis (2018) calls the “alternative influence network,” it is unclear what the exact structure of their digital network is. Because PV is orientated towards informal “citizen” journalism, it lacks the concrete organizational structure of mainstream journalistic institutions. Rather than being anomalous, the "slippery" structure of PV is indicative of broader challenges in defining what exactly the "alt-right" is and who can be considered part of it (Crosset 2018). Boundaries between groups are often in flux, their ideologies mix and mingle, and their network structures and participants may overlap (Lewis 2018, Marwick and Lewis 2017).


The fluid nature of alt-right digital communities poses challenges for qualitative digital researchers. In particular, digital ethnographers may struggle to define their "field sites” in the same way that their physical counterparts can. Over the past year, I encountered this challenge when I was conducting a digital ethnography of Project Veritas, James O'Keefe, and their followers. This research project initially aimed to situate PV and O’Keefe within the alt-right ecosystem, and in doing so, map the “field site" of my ethnography. In doing so, I intended to not only better define my own field site, but suggest a methodology that other digital ethnographers and qualitative researchers could build upon.


## Challenges and Project 2.0



To map the network around O'Keefe and PV, I planned to scrape articles from the websites of Project Veritas and two other alt-right publishers - Breitbart and The Daily Caller. Then, I planned to run Named Entity Recognition (NER) using the Python library spaCy to build a database of individuals and organizations referenced in relation to PV and O'Keefe. Finally, I intended to develop a social network based on "entity document co-occurrence," following Brandon Rose's model.

However, I discovered that the spaCy library struggles to extract entities from data taken from alt-right publishers. The library spaCy is trained on large, annotated datasets taken from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech." It operates by making predictions about how entities appear based on what it has learned about how language functions from its training data. (Walsh, Intro to Cultural Analytics) This makes it a powerful tool for processing large amounts of text by recognizing and classifying well-known, meaningful entities, such as common names and brands.

But because Spacy makes predictions based on its training data, it is less accurate at processing domain-specific textual data. (Kiselov 2022) As such, programmers working with domain-specific data may manually annotate a set of textual data and use this to train a custom language model, resulting in more accurate predictions. Alt-right content often includes its own domain-specific linguistic style and terminology. Content published on Breitbart, Project Veritas, and The Daily Caller reads much differently from content produced by professional journalists. Organizations and individuals are often referred to in shorthand, slang, or abbreviations. The way entities appear in the text is also inconsistent, with the same entities written differently throughout the article.

Although I still intend to develop a social network of my fieldsite, doing so with the default spaCy model would result in a partial dataset. I decided to use this term paper to create a custom model of spaCy that I can later use to extract entities from my dataset and build a social network



## Significance
***

This study is significant for its subject matter as well as its methodology. Regarding its subject matter, O’Keefe and PV are influential actors in the alt-right ecosystem. O'Keefe's group has an outsize influence despite their small size. Their video-first content often “goes viral,” influencing the agenda of mainstream news sites and even impacting policymaking, as in the case of the ACORN controversy in 2009 (Dreier and Martin 2010). Despite the influence of O’Keefe, no scholars, to my knowledge, have mapped out this particular network of self-described journalists or identified their exact place in the broader alt-right news ecosystem.


Methodologically, this study provides a potential toolkit for digital ethnographers and qualitative online researchers to map their networked field sites. In particular, it provides an example **of** how Named Entity Recognition models can be trained on domain-specific data. This is methodologically useful for researchers of internet subcultures and alternative or radical political movements, showing how language models can be trained to better recognize subgroup-specific references, individuals, organizations, and abbreviations.


## Research Questions
***

1. What individuals and organizations are mentioned frequently in articles produced by or about PV and O’Keefe?
2. How are Project Veritas and James O’Keefe situated within the broader structure of the alt-right news ecosystem?
3. How will training spaCy on domain specific articles impact the output of NER?

## Hypotheses
***

1. I expect to find that the individuals and actors most frequently mentioned in conversation with PV and O’Keefe will fall into three categories:

 > 1A. Actors and organizations in the broader alt right news ecosystem

 > 1B. Liberal organizations and politicians who are the target of PV’s journalistic operations

 > 1C. Scattered mentions of individual “citizen journalists” and “whistleblowers” who produce PV’s content.

2. I expect that my custom model of spaCy will be able to recognize a greater number of entities and organizations with a higher degree of accuracy than the default spaCy library.


## Methods
***

## Data Gathering

Data for this project consists of articles scraped from the websites of three alt-right publishers: Project Veritas, Breitbart, and the Gateway Pundit. Breitbart and the Gateway Pundit were selected because they are two prominent content producers in the alt-right ecosystem. They are also some of the few websites that allow scraping of articles and do not have paywalls. Other sites (Infowars, the Daily Caller, and the Epoch Times) were excluded because they were inaccessible.

I included articles in my study that fell within the timeframe of January 1, 2018 through January 1, 2023. Because of its small size, PV produces less content than mainstream news sites. This five year time frame will allow me to capture a larger and more representative dataset.

The methods of web scraping varied by site. Project Veritas does not have a traditional “archive” of articles on its website, but rather a collection of what it calls “landmark undercover investigations.” With the assistance of bardeen.ai I extracted and saved the links for each of these 43 articles.
Both the Gateway Pundit and Breitbart have searchable and filterable archives. I pulled up all articles containing the tags "James O'Keefe" and "Project Veritas" and saved the links to each using bardeen.ai. I pulled 275 articles from Breitbart and 92 from the Daily Caller.

I compiled these links into a separate CSV file for each publisher. I chose to save them in separate CSVs rather than a single CSV because each publisher's content contained unique patterns in how the scraped text was displayed, which became relevant in my data cleaning (next section). Next I used the requests library to extract the text from each link and save it as a new tab in the CSV. I then used Beautiful Soup to strip away the HTML format. My code was taken from what we learned during class lectures. I also consulted ChatGPT for some modifications when I was having difficulty saving each of my articles individually in the correct directory.




In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

from google.colab import drive
drive.mount('/content/drive')

#repeated each of the below steps for breitbart CSVs and daily caller CSVs

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
veritas_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/final_project/veritas_urls.csv", delimiter=',', encoding='utf-8')

In [None]:
def scrape_article(url):
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_string = response.text
    return html_string

In [None]:
veritas_df['text'] = veritas_df['url'].apply(scrape_article)

In [None]:
folder_path = "/content/drive/MyDrive/Colab Notebooks/final_project/test"
os.makedirs(folder_path, exist_ok=True)

In [None]:
id = 0
for text in veritas_df['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()

    id += 1
    try:
        with open(f"folder_path{id}.txt", "w") as file:
            file.write(str(article))
    except Exception as e:
        print(f"Error saving article {id}: {e}")

In [None]:
veritas_df

## Data Cleaning

These articles had a very messy format and took a long time to clean. I cleaned the data for each publisher (Veritas, Daily Caller, and Breitbart) separately because there were different patterns in the text of each publisher.

I removed duplicate articles, removed stopwords, made all text lowercase, removed special characters, ran spellcheck, and finally lemmatized the data using parts of speech tags. I used the methods we had developed in class. I also consulted ChatGPT to help modify the scripts to iterate through each article in my folders. I used the Python os module as well as regular expressions to remove the irrelevant, non-article content at the head and foot of each article. The methods differed slightly for each publisher.


In [None]:
import os
import re

In [None]:
#Remove special characters - repeated separately for articles from each of the three publishers.
#I asked ChatGPT to write a for loop that would run through each file in my folder and apply the remove special characters function

folder_path = '/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/veritas_articles'

txt_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Define a function to remove special characters from a string
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Loop through each .txt file and remove special characters
for txt_file in txt_files:
    file_path = os.path.join(folder_path, txt_file)

    with open(file_path, 'r') as file:
        content = file.read()

    cleaned_content = remove_special_characters(content)

    with open(file_path, 'w') as file:
        file.write(cleaned_content)

In [None]:
#Next I remove extra white space. As with the last step, I prompted ChatGPT to supply me with a code that would iterate through every file in my folder and remove white space.

breitbart_folder_path = '/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/veritas_articles'

def remove_white_space(file_path):
    # Read the contents of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Remove extra spaces
    content = re.sub(' +', ' ', content)

    # Remove extra blank lines
    content = re.sub('\n\s*\n', '\n\n', content)

    # Write the cleaned content back to the file
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(content)


# Iterate through all .txt files in the folder
for filename in os.listdir(breitbart_folder_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(breitbart_folder_path, filename)
        remove_white_space(file_path)

print("Text files cleaned successfully.")

In [None]:
# Remove capitalization

for filename in os.listdir(daily_caller_folder_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(daily_caller_folder_path, filename)

        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

        modified_content = content.lower()

        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(modified_content)

In [None]:
# I asked ChatGPT to create a spellcheck loop that will run through all of my files, spellcheck, and correct them.
# I applied this to each publisher.

from spellchecker import SpellChecker

def spellcheck_and_modify(file_path):
    # Read the contents of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()


    words = content.split()
    spell = SpellChecker()
    misspelled = spell.unknown(words)
    corrected_words = [spell.correction(word) if word in misspelled else word for word in words]
    corrected_words = [word if word is not None else original_word for word, original_word in zip(corrected_words, words)]

    content_corrected = ' '.join(corrected_words)

    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(content_corrected)

def process_files_in_folder(breitbart_folder_path):
    # Iterate through all .txt files in the folder
    for filename in os.listdir(breitbart_folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(breitbart_folder_path, filename)
            spellcheck_and_modify(file_path)


process_files_in_folder(breitbart_folder_path)


In [None]:
#Remove stopwords

import os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

for filename in os.listdir(daily_caller_folder_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(daily_caller_folder_path, filename)

        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

        words = word_tokenize(content)

        filtered_words = [word for word in words if word.lower() not in stop_words]

        modified_content = ' '.join(filtered_words)

        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(modified_content)

In [None]:
# To remove any duplicate files from each folder I used the following code (repeated for each of the three publishers) - used pandas and drop.duplicates
veritas_folder_path = "/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/veritas_articles"

def remove_duplicates(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    df = pd.DataFrame({'lines': lines})

    df_unique = df.drop_duplicates()

    with open(file_path, 'w', encoding='utf-8') as file:
        file.writelines(df_unique['lines'])

def process_files_in_folder(folder_path):
    for filename in os.listdir(veritas_folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(veritas_folder_path, filename)
            remove_duplicates(file_path)

process_files_in_folder(veritas_folder_path)

In [None]:
# These articles contained a lot of irrelevant text at the top and bottom that was pulled when I made the requests
# I manually identified patterns in the text data
# An example of a pattern: Breitbart has non-article text that starts with "COMMENTS/nPlease let us know if you are having issues with commenting" and ends with "Copyright 2023 Breitbart"
# I asked ChatGPT how I could go about removing this irrelevant content. It suggested the regular expressions library. It took a few iterations of prompts to get a script that worked.
# I ran modified versions of this script based on patterns found in other areas of my data

import os
import re

breitbart_folder_path = '/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/breitbart_articles'
test_article_breitbart = '/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/breitbart_articles/breitbart_articles_1.txt'

start_lines_text = 'COMMENTS\nPlease let us know if youre having issues with commenting\n'
end_line_text = 'Copyright 2023 Breitbart'

def remove_end_content(test_article_breitbart, start_lines, end_line):
    # Read the contents of the file
    with open(test_article_breitbart, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Combine consecutive lines
    concatenated_lines = ''.join(lines)

    # Construct a regular expression to match the content between start and end lines
    pattern = re.compile(re.escape(start_lines) + '(.*?)' + re.escape(end_line), re.DOTALL)

    # Find the starting and ending indices of the content to remove
    match = pattern.search(concatenated_lines)

    if match:
        start_index = match.start()
        end_index = match.end()

        # Remove the content between start and end lines
        modified_content = concatenated_lines[:start_index] + concatenated_lines[end_index:]

        # Write the modified content back to the file
        with open(test_article_breitbart, 'w', encoding='utf-8') as file:
            file.write(modified_content)

def remove_end_content_folder(breitbart_folder_path, start_lines, end_line):
    # Iterate through all .txt files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            remove_end_content(file_path, start_lines, end_line)


remove_end_content_folder(folder_path, start_lines_text, end_line_text)



In [None]:
#Tag each word in each file with part of speech to prepare for lematizing

from nltk import pos_tag
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')

nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def tag_pos_in_files(daily_caller_folder_path):
    for filename in os.listdir(daily_caller_folder_path):  # Fix variable name here
        if filename.endswith('.txt'):
            file_path = os.path.join(daily_caller_folder_path, filename)

            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()

            words = word_tokenize(content)
            pos_tags = [get_wordnet_pos(word) for word in words]
            word_pos_tuples = list(zip(words, pos_tags))

            print(f"File: {filename}")
            print(word_pos_tuples)
            print("\n")

tag_pos_in_files(daily_caller_folder_path)

In [None]:
#lemmatize based on the POS tags

def lemmatize_files(daily_caller_folder_path):
    lemmatizer = WordNetLemmatizer()

    for filename in os.listdir(daily_caller_folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(daily_caller_folder_path, filename)

            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()

            words = word_tokenize(content)

            pos_tags = [get_wordnet_pos(word) for word in words]

            lemmatized_words = [lemmatizer.lemmatize(word, pos=pos_tag) for word, pos_tag in zip(words, pos_tags)]

            lemmatized_content = ' '.join(lemmatized_words)

            with open(file_path, 'w', encoding='utf-8') as file:
                file.write(lemmatized_content)

lemmatize_files(daily_caller_folder_path)


## Running NER with spaCy's default large language model
***



When I was finished with data cleaning, I tried running spaCy on a sample of my dataset. I instantly encountered the problem that led me to redefine my project. Even though I was running spaCy's largest language model, many of the entities in my data were not being picked up. As can be seen below, spaCy failed to detect numerous entities, primarily "organizations." This appears to be because PV, Breitbart, and the Daily Caller often shorten the names of organizations in atypical and inconsistent ways.

Below is the set of organizations, institutions, and government agencies that spaCy should have detected in my "test" article and applied the "ORG" label to. However, these entities were not detected by spaCy.

1.   DARPA [Defense Advanced Research Projects Agency]
2. Project Veritas
3. Department Defense [US Department of Defense]
4. NAIAD [National Institute of Allergy and Infectious Diseases]
5. Inspector general [Office of the Inspector General]
6. NIH [National Institutes of Health]
7. DOD [Department of Defense]
8. USDR [US Digital Response]

It is also apparent that the "sensitivity" of spaCy to changes in how individual names are listed is not high enough to detect the inconsistent ways that individuals are referred to in my dataset. For example, "Dr. Fauci" is correctly labeled, but "Fauci" by itself is not detected.

These numerous oversights indicate that even the large language model of spaCy cannot accurately predict how named entities appear in my domain specific texts. Running this NER on my entire dataset would thus result in an incomplete and inaccurate representation of the named entities. This would make my map of the social network incomplete.

## Training Custom Named Entity Recognition Model
***



## Process
***

Rather than using a default model, I decided to pivot my project and train a custom NER model. Training a model with my own datasets allows spaCy to better predict how entities will appear in the specific domain, and will ideally allow spaCy to detect named entities written in the inconsistent, shorthand terminology of alt-right articles.

I started out by basing my method on Nisanth N's article, "Training Custom NER" (2020) in Towards Data Science. This is a bare-bones outline of the essential steps needed to train a model. To understand the process in greater depth, I consulted Deepak John Reji's YouTube and github tutorials. For further specificity, I asked ChatGPT to break down each step of code, troubleshoot errors, and/or modify Nisanth and Reji's lines of code when I encountered challenges. I built my model on top of spaCy's preloaded large language model.

The most time consuming aspect of this process resulted from mistakenly running incompatible versions of Python libraries. The tutorials I consulted were based on an older version of spaCy that required training data to be in tuple format. However, the newer version, 3.7.2, required "example" format. I also ran into issues with the version of spaCy's large language model that I had loaded. I used ChatGPT to very slowly rectify these version issues. While my code ultimately worked, there are likely redundant and/or inefficient sections of it due to many rounds of troubleshooting back and forth with ChatGPT.


## Data Annotation
***

NER models are trained on datasets that are annotated by hand to indicate where specific entities appear in a body of text. The first step to training a model is thus to annotate domain specific data so that it can be uploaded for training. Annotating entails identifying an entity, applying a label (ie: person) to that entity, and indicating numerically where the entity appears in the text.

I used an open-source annotator, https://tecoholic.github.io/ner-annotator/, to annotate my data in tuple format. I started by annotating one test article (previously pictured) in my datatest with "PERSON" and "ORG tags. Once this test article was annotated, I saved it in JSON format and followed Deepak John Reji's YouTube tutorial to ensure it was in the correct tuple format.

In [None]:
# The following scripts are based on Deepak John Reji's YouTube tutorial about annotating NER training data.
# I asked ChatGPT to explain Reji's code to me so I would have a better understanding of what it was doing.
# When I ran into errors in my code, I asked ChatGPT to identify the errors and modify the lines of code

! pip install spacy==3.7.2
import spacy
from __future__ import unicode_literals, print_function
!pip install plac
import plac
import random
from pathlib import Path
from tqdm import tqdm
import json
import os

!pip install -U spacy
!pip install spacy-lookups-data
!python -m spacy download en
!python -m spacy download en_core_web_lg
import en_core_web_lg

In [None]:
# Import annotated test training article in JSON format

json_path = '/content/drive/MyDrive/Colab Notebooks/final_project/train_data/annotations_org_1.json'

with open(json_path, 'r') as f:
    data = json.load(f)

In [None]:
# This loop makes sure that the JSON data is in the correct tuples format to be readable by the model

entity_name = "ORG"

train_data = data['annotations']
train_data = [tuple(i) for i in train_data]
for i in train_data:
    if i[1]['entities'] == []:
        i[1]['entities'] = (0, 0, entity_name)
    else:
        i[1]['entities'][0] = tuple(i[1]['entities'][0])

for i in train_data:
    if i[1]['entities'] == []:
        i[1]['entities'] = (0, 0, entity_name)
    else:
        i[1]['entities'][0] = tuple(i[1]['entities'][0])

In [None]:
train_data

In [None]:
# Transitioning to training the model, building on the large language model

model = en_core_web_lg
output_dir=Path("/content/drive/MyDrive/Colab Notebooks/final_project/conNER")
n_iter=100

In [None]:
model_name = "en_core_web_lg"

if model_name is not None:
    nlp = spacy.load(model_name)
    print("Loaded model '%s'" % model_name)
else:
    nlp = spacy.blank('en')
    print("Created blank 'en' model")

# Set up the pipeline
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else:
    ner = nlp.get_pipe('ner')


Loaded model 'en_core_web_lg'


In [None]:
model =  ('/content/drive/MyDrive/Colab Notebooks/final_project/conNER')
n_iter=100
custom_model_name = "conNER"


if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    custom_model_path = output_dir / custom_model_name
    nlp.to_disk(custom_model_path)

In [None]:
# I initially ran this loop in tuple format, but the newer version of spaCy requires "example" format
# ChatGPT provided a modified loop that processes my tuples into examples that can be read by the model

from pathlib import Path
from spacy.training.example import Example

# Assuming train_data is a list of tuples (text, annotations)
examples = []

for text, annotations in train_data:
    example = Example.from_dict(nlp.make_doc(text), annotations)
    examples.append(example)

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(examples)
        losses = {}
        for example in tqdm(examples):
            nlp.update(
                [example],  # Pass a list containing the Example object
                drop=0.5,
                losses=losses
            )
        print(losses)


# Save the model after the entire training loop
output_dir = Path("/content/drive/MyDrive/Colab Notebooks/final_project/conNER")
nlp.to_disk(output_dir / "conNER_updated")

### Testing the Model
***
After this first round of training, I annotated the entities on one more article and ran it through my model. With two sets of training data processed, conNER (or Conservative Named Entity Recognition) was ready to be tested. In the following code, I ran conNER on the same test article that I used as an example of spaCy's lacking prediction ability earlier. I wanted a direct comparison of how the model's predictions have changed as a result of the training.

In [None]:
import spacy
from spacy import displacy
import en_core_web_lg
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400
import glob
from pathlib import Path
import requests
import pprint
from bs4 import BeautifulSoup

In [None]:
model_path = "/content/drive/MyDrive/Colab Notebooks/final_project/conNER"

In [None]:
nlp = spacy.load(model_path)

In [None]:
filepath = "/content/drive/MyDrive/Colab Notebooks/final_project/articles_all/article_1.txt"
text = open(filepath, encoding='utf-8').read()
doc = nlp(text)

In [None]:
displacy.render(doc, jupyter=True, style="ent")

### Model Output
***
Despite only being trained on two article's worth of domain-specific data, conNER was noticeably more accurate about detecting entities than the default spaCy model. I ran the custom model, conNER, on the same test article as I ran the spaCy default model on. It correctly tagged all of the organizations that were not identified by the default model. These include:

1. DARPA
2. Project Veritas
3. Department Defense
4. Inspector General
5. NAIAD
6. NIH
7. USDR
8. DOD

However, conNER incorrectly identified "fauci testimony" as an organization, rather than labeling "fauci" as an individual. It also labeled "Washington DC" as an organization. This indicates that, while the custom model may be better at correctly identifying organizations, it is prone to false positives that identify individuals as organizations. Given that I have only trained it on one "PERSON" annotated article and one "ORG" annotated article, these are encouraging results. Further training would likely increase the model's ability to predict named entities and make its identification more accurate.

## Conclusions, Limitations, and Future Directions
***

Although my custom language model is only in its preliminary stages, the early results are promising. The model is able to more accurately predict when organizations appear in the text in abbreviated or atypical ways - such as "DOD" for Department of Defense or "Inspector General" for "Office of the Inspector General." This shows increased fluency in the terminology of alt-right news sites, which often shirk the formal linguistic styles and terminology of professional news organizations.


However, conNer appears to be over-inclusive in what it identifies as an "organization," such as labeling "Fauci testimony" as an organization. This tendency to over-identify named entities as "organizations" might be rectified with further training. So, while HQ2 appears correct, the model is not complete enough to run my initially proposed analysis.


As described in my introduction, I started this project with research questions about the nature of the network around Project Veritas and James O'Keefe. My original goal was to map the social network of this self-described citizen journalism organization. However, issues with the spaCy NER model's ability to accurately predict entity occurrence in my domain datasets complicated my objectives and methods.


At this point, my custom model is limited. I have not had the time to adequately train conNER on enough data where I feel comfortable applying it to my entire corpus of data. I intend to train conNer on a larger number of  datasets annotated with "PERSONS" and "ORGS" to increase its predictive accuracy. Once I am satisfied with the model's predictions, I intend to run my initially proposed analysis, encompassing research questions RQ1 and RQ2 and hypotheses H1, H1A, H1B, and H1C.


Because I am so new to coding, I built my custom NER model with the assistance of a variety of tutorials and ChatGPT suggestions. I encountered many errors during the programming process and primarily used ChatGPT to assist me in fixing them. Although ChatGPT was a very useful tool,  I think it is likely that relying on its suggestions may have made my code clunkier and more inconsistent than it could have been if I had the skill set to program it from scratch. Thus, the major limitation of my study  is that I am not fully confident in the quality of my code. To confidently assess and deploy my model, I need to improve both my conceptual knowledge of coding and my technical skills.


Despite these limitations, my study provides some methodological guidance for digital ethnographers and qualitative researchers seeking to map the networked field sites they are studying. Natural language processing is a powerful tool. However, because it operates by prediction based on given datasets, it is less adept at parsing the terminology of specific internet or political subcultures. Training custom models provides a way for researchers of digital communities and subcultural, fringe, or extremist groups to harness more accurate language models.


Moving forward in my own ethnographic work, I look forward to integrating natural language processing as a way to map the digital field sites I am studying.Improving my custom language model will allow me to map the social network of my ethnographic fieldsite.







