# Weighting Decision and ML Target Dataset

**Purpose:** This notebook documents our choice of sample weighting for counties with low monitoring coverage, then produces the ML-ready dataset: county identifiers plus the target variable (Median AQI) and sample weights for model training.

**Output:** A CSV with `State`, `County`, `median_aqi`, and `sample_weight` — ready to merge with socioeconomic features and train the model.

## 1. Summary of the Weighting Decision

We chose **sample weighting** over exclusion because:

1. **Retention:** Excluding low-coverage counties would bias our sample toward well-monitored (often urban) areas. Weighting lets us retain all counties while reducing the influence of less reliable targets.

2. **Method:** We define `sample_weight = min(1, Days with AQI / T)` where *T* is a reference threshold. Counties with fewer than *T* days are down-weighted proportionally.

3. **Reference threshold:** We use **180 days** (≈ half a year). This choice balances:
   - Counties with ≥180 days receive full weight (1.0)
   - Counties with 90 days receive weight 0.5
   - Counties with 30 days receive weight 0.17

4. **Implementation:** Most ML frameworks (scikit-learn, XGBoost, etc.) accept `sample_weight` in `fit()`. We pass this column when training.

## 2. Load Data and Compute Target and Weights

In [7]:
import pandas as pd
import numpy as np

try:
    df = pd.read_csv('../aqi-datasets/Access_to_a_Livable_Planet_Dataset_cleaned.csv')
except FileNotFoundError:
    df = pd.read_csv('../aqi-datasets/Access_to_a_Livable_Planet_Dataset.csv')
    df['County'] = df['County'].str.strip()
    df['State'] = df['State'].str.strip()

# Target: Median AQI (already in dataset; 0-500 scale, higher = worse air quality)
df['median_aqi'] = df['Median AQI']

# Sample weight: down-weight low-coverage counties (reference = 180 days)
REFERENCE_DAYS = 180
df['sample_weight'] = np.minimum(1.0, df['Days with AQI'] / REFERENCE_DAYS)

df.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10,median_aqi,sample_weight
0,Alabama,Baldwin,2025,241,174,67,0,0,0,0,87,56,42,0,0,91,150,0,42,1.0
1,Alabama,Clay,2025,239,204,34,1,0,0,0,133,52,32,0,0,0,239,0,32,1.0
2,Alabama,DeKalb,2025,243,191,52,0,0,0,0,93,55,42,0,0,156,87,0,42,1.0
3,Alabama,Elmore,2025,177,172,5,0,0,0,0,64,46,32,0,0,177,0,0,32,0.983333
4,Alabama,Etowah,2025,241,153,88,0,0,0,0,87,58,45,0,0,72,169,0,45,1.0


## 3. Create ML Target Dataset

Extract only the columns needed for modeling: county identifiers and the target plus sample weight. This dataset will be merged with socioeconomic features in the next step.

In [8]:
ml_target = df[['State', 'County', 'Year', 'median_aqi', 'sample_weight']].copy()

print("Shape:", ml_target.shape)
print("\nSample:")
ml_target.head(10)

Shape: (978, 5)

Sample:


Unnamed: 0,State,County,Year,median_aqi,sample_weight
0,Alabama,Baldwin,2025,42,1.0
1,Alabama,Clay,2025,32,1.0
2,Alabama,DeKalb,2025,42,1.0
3,Alabama,Elmore,2025,32,0.983333
4,Alabama,Etowah,2025,45,1.0
5,Alabama,Jefferson,2025,53,1.0
6,Alabama,Lawrence,2025,23,0.105556
7,Alabama,Madison,2025,42,1.0
8,Alabama,Mobile,2025,43,1.0
9,Alabama,Montgomery,2025,44,1.0


In [9]:
# Save to CSV for use in model pipeline
OUTPUT_PATH = '../aqi-datasets/ml_target_dataset.csv'
ml_target.to_csv(OUTPUT_PATH, index=False)
print(f"Saved to {OUTPUT_PATH}")
print(f"Rows: {len(ml_target)}")
print(f"Columns: {list(ml_target.columns)}")

Saved to ../aqi-datasets/ml_target_dataset.csv
Rows: 978
Columns: ['State', 'County', 'Year', 'median_aqi', 'sample_weight']


## 4. Next Step

Merge `ml_target_dataset.csv` with county-level socioeconomic data (e.g., median income, education, poverty, race/ethnicity) on `State` and `County`. Use `median_aqi` as the target and `sample_weight` when calling `model.fit(X, y, sample_weight=weights)`.