# EPA Pesticide Applied Acreage Data
## Data Overview
This dataset provides estimates of pesticide-treated acreage at the state level for the years 2012 and 2017, as reported to the U.S. Environmental Protection Agency (EPA). It reflects the extent of pesticide use across different agricultural lands and helps characterize the intensity of chemical applications in farming operations by state and by pesticide category (e.g., insecticides, herbicides, fungicides).

## Data Structure
    state (string): Name of the U.S. state
    year (int): Year of report (2012 or 2017)
    agricultural_land_total (float): Total agricultural acreage
    treated_acreage_total (float): Total acreage treated with any pesticide
    treated_insecticide (float): Acreage treated with insecticides
    treated_herbicide (float): Acreage treated with herbicides
    treated_fungicide (float): Acreage treated with fungicides

Values are stored as numeric or string fields. Some may include special placeholders such as (D) indicating data is undisclosed, or strings formatted with commas that need to be converted to numeric values for analysis.

## Data Collection & Processing
This data was obtained from EPA summaries of pesticide usage across U.S. states. It was cleaned to standardize state names, filter only valid 50 U.S. states, and include only the years 2012 and 2017 for comparison. Textual placeholders like (D) and "NA" were converted to missing values. Comma-formatted numbers were parsed into floats for further analysis and visualization.

In [1]:
# Import required libraries
import pandas as pd

# Read the raw pesticide dataset
df_pesticide = pd.read_csv("Pesticide applied acreage.csv")

# Print the number of rows and preview the structure
print(f"Dataset size: {len(df_pesticide)} rows")
df_pesticide.head(5)

Dataset size: 150 rows


Unnamed: 0,state,year,Insects,"Weeds, grass, or brush",Nematodes,Diseases in crops and orchards,Pesticide_applied_acreage
0,ALABAMA,2017,1124965,2080369,152793,378018,3736145
1,ALASKA,2017,408,11071,31,81,11591
2,ARIZONA,2017,746365,828911,51916,79397,1706589
3,ARKANSAS,2017,3915540,6433092,371970,1809370,12529972
4,CALIFORNIA,2017,6513981,7007896,913554,2686889,17122320


In [2]:
# Define valid U.S. states for filtering
US_STATES = [
    'ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS', 'CALIFORNIA', 'COLORADO', 'CONNECTICUT',
    'DELAWARE', 'FLORIDA', 'GEORGIA', 'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA', 'IOWA',
    'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MAINE', 'MARYLAND', 'MASSACHUSETTS', 'MICHIGAN',
    'MINNESOTA', 'MISSISSIPPI', 'MISSOURI', 'MONTANA', 'NEBRASKA', 'NEVADA',
    'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO', 'NEW YORK', 'NORTH CAROLINA',
    'NORTH DAKOTA', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'RHODE ISLAND',
    'SOUTH CAROLINA', 'SOUTH DAKOTA', 'TENNESSEE', 'TEXAS', 'UTAH', 'VERMONT',
    'VIRGINIA', 'WASHINGTON', 'WEST VIRGINIA', 'WISCONSIN', 'WYOMING'
]

In [3]:
# Define the cleaning function
def clean_pesticide_data(file_path: str) -> pd.DataFrame:
    '''
    Loads and cleans pesticide usage data. Filters for 50 U.S. states and the years 2012 and 2017.
    Converts numeric columns, removes special values, and returns cleaned DataFrame.

    Parameters:
        file_path (str): Path to raw pesticide data CSV file.

    Returns:
        pd.DataFrame: Cleaned dataset ready for analysis.
    '''
    df = pd.read_csv(file_path)

    # Standardize state names
    df["state"] = df["state"].str.upper().str.strip()

    # Filter to 50 US states and years 2012 and 2017
    df = df[
        (df["state"].isin(US_STATES)) &
        (df["year"].isin([2012, 2017]))
    ].copy()

    # Replace known non-numeric strings with NaN
    non_numeric_values = ["(D)", "NA", "NaN", "N/A", "<NA>"]
    df.replace(non_numeric_values, pd.NA, inplace=True)

    # Clean and convert columns to numeric
    for col in df.columns[2:]:
        df[col] = (
            df[col]
            .astype(str)
            .str.replace(",", "", regex=False)
        )
        df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

In [4]:
# Apply the function to your CSV
cleaned_df = clean_pesticide_data("Pesticide applied acreage.csv")

# Save the cleaned file
cleaned_df.to_csv("cleaned_pesticide_data.csv", index=False)

# Preview
print("Cleaned data preview:")
cleaned_df.head()

Cleaned data preview:


Unnamed: 0,state,year,Insects,"Weeds, grass, or brush",Nematodes,Diseases in crops and orchards,Pesticide_applied_acreage
0,ALABAMA,2017,1124965,2080369,152793.0,378018.0,3736145
1,ALASKA,2017,408,11071,31.0,81.0,11591
2,ARIZONA,2017,746365,828911,51916.0,79397.0,1706589
3,ARKANSAS,2017,3915540,6433092,371970.0,1809370.0,12529972
4,CALIFORNIA,2017,6513981,7007896,913554.0,2686889.0,17122320


## Data Quality & Limitations
    Disclosure flags: Many cells include (D) to indicate nondisclosure for privacy or statistical reliability, resulting in missing values.
    Time scope: Only two years (2012, 2017) are available, limiting trend analysis.
    Coverage: The dataset aggregates treated acres but does not break down pesticide use by specific crop types or chemical names.
    Overlap: Treated acreage categories (e.g., herbicide, insecticide) may overlap, meaning total treated acreage is not a sum of all subcategories.