# EPA IDS Data
We use EPA’s IDS database to supplement formal ECHO violation records by capturing self-reported pesticide-related incidents. By using topic modeling and textual analysis, we filtered agriculture-related exposure cases and aggregated them by state and year. This enables us to build a proxy for potential underregulation and compare it against official enforcement outcomes across SLAs.
## Data Overview 

This dataset includes self-reported pesticide-related incidents from 2014-2024. It captures the type of incident (Human-related is the one I focused on but theres ~5 different types), the state it occured  in, and the county if available. This data offers an alternative to the echo data in how incidents within the agriculture space   are treated. 

## Data Collection
Data was obtained from scraping from the EPA IDS website
Queries were left blank to allow for every possible incident to be analyzed 

## Data Structure 
Incident Data (DateTime Object): When the incident was reported 
Reason for Report (String): 2 word summary of incident  
Impact of Incident (String): Denotes the incident to a 2-letter code, ie HC for Human Moderate 
Country (String): Country of Occurrence  
State (String): State of Occurrence 
County (String): County of  Occurrence 
City(String): City of Occurrence 
Product Reg # (String): If a pesticide was involved or reported what is the # code 
Product Name (String): If a pesticide was involved what is the product 
PC Code (Int): An EPA designated code for each pesticide 
Active Ingredients (String): The actice ingredient in the pesticide 
Clean (String): A cleaned version of what happened with the incident  

In [1]:
import pandas as pd

# Load raw data
df = pd.read_csv("IDS.csv")

# show data size and the first 5 raws
print("dataset size: ", len(df))
df.head()


dataset size:  18609


Unnamed: 0,Incident Number,Incident Date,Reason for Report,Impact of Incident,Country,State,County,City,Product Registration Number(s),Product Names,PC Codes,Active Ingredient(s),Overall Submission Description (may describe multiple incidents)
0,026567-00010,02/26/2014,Adverse Reaction,ON - Other Nontarget,US,CA,Merced,,,,35001.0,Dimethoate,"US EPA Region 9: 21 reports including updates,..."
1,026567-00018,05/08/2014,Adverse Reaction,ON - Other Nontarget,US,CA,Tulare,,,,,Unknown Ingredients,"US EPA Region 9: 21 reports including updates,..."
2,026800-00005,05/14/2014,Adverse Reaction,ON - Other Nontarget,CN,,Ontario,Grey,,,44309.0,Clothianidin,"8 Canadian bee kill incidents, Ontario, 2012, ..."
3,026800-00005,05/14/2014,Adverse Reaction,ON - Other Nontarget,CN,,Ontario,Grey,,,109302.0,Fluvalinate,"8 Canadian bee kill incidents, Ontario, 2012, ..."
4,026800-00005,05/14/2014,Adverse Reaction,ON - Other Nontarget,CN,,Ontario,Grey,,,99050.0,Acetamiprid,"8 Canadian bee kill incidents, Ontario, 2012, ..."


## Data Processing 
Afterwards data was ran through topic modeling and an LLM to help cut down the data to only be agricultural related 

Then the cut down data was imported into python and further cleaned through Regex, only including american incidents, dropping unncessary coluns, and making sure to only include incidents that are related to humans 

Data Quality & Limitations 

The biggest issue with the data is the fact that it is self-reported. This means that a lot of the incident descriptions are not very well detailed and it also leaves to the possibility that many incidents are not reported. To combat this I plan to pair the data with the ECHO data to use the IDS as a way to see which states are most compliant with policies.  

In [2]:
# remove column names and empty space
df.columns = df.columns.str.strip()
df["Country"] = df["Country"].str.strip().str.upper()

# filter for human-related incidents
df_human = df[df["Impact of Incident"].str.contains("H", na=False)]

# filter for U.S.-based incidents
df_human = df_human[df_human["Country"].str.upper() == "US"]

# Filtered dataset size
print("Filtered dataset size: ", len(df_human))
df_human.head()

Filtered dataset size:  13011


Unnamed: 0,Incident Number,Incident Date,Reason for Report,Impact of Incident,Country,State,County,City,Product Registration Number(s),Product Names,PC Codes,Active Ingredient(s),Overall Submission Description (may describe multiple incidents)
5,027212-00004,11/07/2014,Adverse Reaction,"HB - Human - Major, HC - Human - Moderate",US,MS,,,000100-00990,Demon Wp Insecticide,109702,Cypermethrin,"Syngenta: 1 H-B and 4 H-C individual reports, ..."
6,027220-00001,11/06/2014,Product Defect,HC - Human - Moderate,US,VA,Augusta,Weyers Cave,072959-00006,Degesch Fumi Cel,66504,Magnesium phosphide,Degesch America: 1 H-C: trauma & severe bruisi...
7,027251-00001,11/17/2014,Adverse Reaction,HA - Human Fatality,US,CO,,Briscoe,005185-00505-080306,The Works Toilet Bowl Cleaner,45901,Hydrochloric acid,Biolab: 1 H-A: male's suicidal death involving...
8,027251-00001,11/17/2014,Adverse Reaction,HA - Human Fatality,US,CO,,Briscoe,,Hi Yield Lime Sulfur Spray,76702,Lime sulfur,Biolab: 1 H-A: male's suicidal death involving...
9,027261-00001,11/23/2014,Adverse Reaction,"HC - Human - Moderate, HD - Human - Minor",US,KY,,Science Hill,,Total Pest Control,109701,Permethrin,NPIC: woman and her daughter experienced sympt...


In [3]:
import re

def remove_years(text):
    # Remove 4-digit years from 1980–2029
    text = re.sub(r'\b(20[0-2][0-9]|19[8-9][0-9])\b', '', text)
    # Remove all numbers (standalone or attached)
    text = re.sub(r'\b\d+\b', '', text)
    return text

MONTHS = [
    "jan", "january", "feb", "february", "mar", "march", "apr", "april",
    "may", "jun", "june", "jul", "july", "aug", "august", "sep", "sept", "september",
    "oct", "october", "nov", "november", "dec", "december", "week"]
BRANDS = ["monsanto", "basf", "bayer", "jk", "sc", "johnson", "benckiser", "reckitt", "otts", "lonza", "pb", "engenia",
         "mgk"]
COMMON_NOISE = ["report", "reports", "update", "summaries", "entered", "annual", "aggregate", "approximately", "il", "region", "products", "scotts", "unite"]

def remove_noise_words(text):
    text = text.lower()
    for word in BRANDS + COMMON_NOISE + MONTHS:
        text = re.sub(rf'\b{re.escape(word)}\b', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [4]:
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
tokenizer = TreebankWordTokenizer()

def lemmatize_text(text):
    words = tokenizer.tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word, pos='v') for word in words])

def clean_for_lda(text):
    text = str(text).lower()
    text = remove_years(text)
    text = remove_noise_words(text)
    text = lemmatize_text(text)
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    return text


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Use 'clean' if available, else use fallback field
text_col = "Clean" if "Clean" in df.columns else "Overall Submission Description (may describe multiple incidents)"

# Fill missing values with empty strings
texts_cleaned = df_human[text_col].fillna("").apply(clean_for_lda)
df_human['Clean'] = texts_cleaned  # store result back into df_human

# Convert text into vectorized form (bag of words)
vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=10)
X = vectorizer.fit_transform(df_human['Clean'])

# Fit and LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Print top words in each topic
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        top_words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-top_n:]]
        print(f"Topic {idx}: {top_words}")

print_topics(lda, vectorizer)

Topic 0: ['burn', 'ocular', 'rash', 'pa', 'pain', 'hb', 'irritation', 'hc', 'symptoms', 'include']
Topic 1: ['health', 'animal', 'syngenta', 'ea', 'industries', 'unite', 'symptoms', 'hb', 'include', 'hc']
Topic 2: ['pet', 'pa', 'roundup', 'glyphosate', 'human', 'hb', 'hc', 'individual', 'incidents', 'incident']
Topic 3: ['orchard', 'agricultural', 'pesticide', 'experience', 'cac', 'symptoms', 'workers', 'county', 'field', 'application']
Topic 4: ['incident', 'report', 'expose', 'hospital', 'yearold', 'npic', 'male', 'symptoms', 'pesticide', 'experience']


### Topic 0 (Keep)
['burn', 'ocular', 'rash', 'pa', 'pain', 'hb', 'irritation', 'hc', 'symptoms', 'include']

#### Analysis:
Focused on physical symptoms and mild-to-moderate health effects
Contains "ocular", "burn", "rash", "irritation" — all clear human symptoms
Includes "hb", "hc" = health severity codes → remove from topic modeling, but important for labeling

#### Interpretation:
Human Symptom Reports (likely low/moderate severity)

### Topic 1 (Drop)
['health', 'animal', 'syngenta', 'ea', 'industries', 'unite', 'symptoms', 'hb', 'include', 'hc']

#### Analysis:
Mixed content: "animal", "industries", "unite" don’t form a strong agricultural theme
"syngenta" is a brand → noise
"ea" unclear, likely noise
"symptoms" and "health" are general-purpose

#### Interpretation:
Mixed or off-topic, maybe non-agriculture product-related or brand-related incidents

### Topic 2 (Keep):
['pet', 'pa', 'roundup', 'glyphosate', 'human', 'hb', 'hc', 'individual', 'incidents', 'incident']

#### Analysis:
"pet", "roundup", "glyphosate" = household or home-and-garden products
"human" is vague; "pa" is a state code → noise
"hb"/"hc" still present

#### Interpretation:
Household/Pet-related pesticide exposure, not always agriculture
We could extract only relevant records.

### Topic 3 (Keep):
['orchard', 'agricultural', 'pesticide', 'experience', 'cac', 'symptoms', 'workers', 'county', 'field', 'application']

#### Analysis:
This is your best and clearest topic.
Strong keywords: "orchard", "agricultural", "pesticide", "application", "field", "workers"

#### Interpretation:
Agricultural exposure — your main interest

### Topic 4 (Drop):
['incident', 'report', 'expose', 'hospital', 'yearold', 'npic', 'male', 'symptoms', 'pesticide', 'experience']

#### Analysis:
This topic feels generic
"hospital", "yearold", "male" — repeated structure of incident reports
"npic" and "report" → possible system-level noise

#### Interpretation:
Metadata or general incident structure, not conceptually useful on its own

In [6]:
# Get topic distribution per incident
topic_distribution = lda.transform(X)

# Assign the most probable topic to each record
df_human['Topic'] = topic_distribution.argmax(axis=1)

# Show how many records per topic
df_human['Topic'].value_counts()

Topic
0    5408
1    2615
2    2432
4    1497
3    1059
Name: count, dtype: int64

After training the LDA model, I reviewed each topic’s top keywords. Topic 0 included highly agriculture-specific terms such as 'glyphosate', 'dicamba', 'monsanto', and 'growing', indicating it represents farm-related pesticide usage. I then filtered the dataset to include only incidents classified under this topic.

In [7]:
# Keep Topic 0, 2, 3
df_relevant = df_human[df_human['Topic'].isin([0, 2, 3])].copy()

# Filter agriculture exposure
AG_EXPOSURE_KEYWORDS = [
    "pesticide", "application", "spray", "expose", "exposure", "mix", "handle",
    "farm", "field", "crop", "greenhouse", "orchard", "tractor", "vineyard",
    "worker", "farmworker", "applicator", "labor", "symptoms", "irritation",
    "rash", "nausea", "vomiting", "residue", "treated", "chemical", "drift"
]

# Apply new keyword filter
def is_agriculture_related(text):
    text = str(text).lower()
    return any(word in text for word in AG_EXPOSURE_KEYWORDS)

df_filtered = df_relevant[df_relevant['Clean'].apply(is_agriculture_related)].copy()
print(f"Filtered incidents with relevant topics + ag exposure keywords: {len(df_filtered)}")

Filtered incidents with relevant topics + ag exposure keywords: 6002


In [8]:
# extract year from 'incident Date'
df_filtered['year'] = pd.to_datetime(df_filtered['Incident Date'], errors='coerce').dt.year

# Group by state and year
df_ids_summary = df_filtered.groupby(['State', 'year']).size().reset_index(name='ids_event_count')

In [9]:
# save the outputs
df_filtered.to_csv("IDS_agriculture.csv", index=False)
df_ids_summary.to_csv("IDS_state_year_count.csv", index=False)
print("Saved filtered incidents and state-year counts.")

OSError: Cannot save file into a non-existent directory: 'IDS'