# AQI Dataset — Data Cleaning & Validation

This notebook performs data cleaning and validation checks on the Access to a Livable Planet AQI dataset. Each step explains the rationale for transparency to reviewers.

## 1. Load the Data

Load the dataset and inspect its structure to understand what we're working with.

In [13]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('../aqi-datasets/Access_to_a_Livable_Planet_Dataset.csv')

print("Dataset shape:", df.shape)
df.head(10)

Dataset shape: (978, 18)


Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2025,241,174,67,0,0,0,0,87,56,42,0,0,91,150,0
1,Alabama,Clay,2025,239,204,34,1,0,0,0,133,52,32,0,0,0,239,0
2,Alabama,DeKalb,2025,243,191,52,0,0,0,0,93,55,42,0,0,156,87,0
3,Alabama,Elmore,2025,177,172,5,0,0,0,0,64,46,32,0,0,177,0,0
4,Alabama,Etowah,2025,241,153,88,0,0,0,0,87,58,45,0,0,72,169,0
5,Alabama,Jefferson,2025,182,67,114,1,0,0,0,112,70,53,1,2,23,156,0
6,Alabama,Lawrence,2025,19,19,0,0,0,0,0,36,34,23,0,0,0,19,0
7,Alabama,Madison,2025,181,136,45,0,0,0,0,72,56,42,0,0,66,115,0
8,Alabama,Mobile,2025,241,168,73,0,0,0,0,80,55,43,0,0,74,167,0
9,Alabama,Montgomery,2025,237,155,82,0,0,0,0,90,57,44,0,0,49,188,0


## 2. Initial Data Exploration

Understanding column types and the range of values helps us choose appropriate validation checks.

In [14]:
df.info()
print("\n--- Basic statistics ---")
df.describe()

<class 'pandas.DataFrame'>
RangeIndex: 978 entries, 0 to 977
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype
---  ------                               --------------  -----
 0   State                                978 non-null    str  
 1   County                               978 non-null    str  
 2   Year                                 978 non-null    int64
 3   Days with AQI                        978 non-null    int64
 4   Good Days                            978 non-null    int64
 5   Moderate Days                        978 non-null    int64
 6   Unhealthy for Sensitive Groups Days  978 non-null    int64
 7   Unhealthy Days                       978 non-null    int64
 8   Very Unhealthy Days                  978 non-null    int64
 9   Hazardous Days                       978 non-null    int64
 10  Max AQI                              978 non-null    int64
 11  90th Percentile AQI                  978 non-null    int64
 12  Media

Unnamed: 0,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
count,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0,978.0
mean,2025.0,216.485685,161.377301,53.014315,1.680982,0.328221,0.0409,0.043967,113.176892,57.610429,38.50409,0.767894,1.780164,115.142127,92.556237,6.239264
std,0.0,68.272456,56.851544,42.410648,4.747511,2.01977,0.323688,0.532572,93.49294,14.554052,10.653013,6.809464,13.355379,86.394535,79.511388,29.134426
min,2025.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,8.0,8.0,3.0,0.0,0.0,0.0,0.0,0.0
25%,2025.0,181.0,132.0,18.0,0.0,0.0,0.0,0.0,80.0,51.0,36.0,0.0,0.0,33.5,6.5,0.0
50%,2025.0,237.0,167.0,42.5,0.0,0.0,0.0,0.0,99.0,58.0,40.0,0.0,0.0,112.0,87.5,0.0
75%,2025.0,273.0,203.0,80.0,2.0,0.0,0.0,0.0,126.0,64.0,44.0,0.0,0.0,184.0,155.0,0.0
max,2025.0,317.0,293.0,223.0,69.0,38.0,6.0,12.0,2122.0,164.0,90.0,100.0,241.0,304.0,304.0,288.0


In [15]:
# Check for missing values — we preserve NaN for pollutant-specific counts as they indicate lack of monitoring
print("Missing values per column:")
df.isna().sum()

Missing values per column:


State                                  0
County                                 0
Year                                   0
Days with AQI                          0
Good Days                              0
Moderate Days                          0
Unhealthy for Sensitive Groups Days    0
Unhealthy Days                         0
Very Unhealthy Days                    0
Hazardous Days                         0
Max AQI                                0
90th Percentile AQI                    0
Median AQI                             0
Days CO                                0
Days NO2                               0
Days Ozone                             0
Days PM2.5                             0
Days PM10                              0
dtype: int64

## 3. Verify Uniqueness of (State, County, Year)

The composite key (State, County, Year) should uniquely identify each row — otherwise we have duplicates that would bias downstream analysis.

In [16]:
key_cols = ['State', 'County', 'Year']
n_rows = len(df)
n_unique = df[key_cols].drop_duplicates().shape[0]

print(f"Total rows: {n_rows}")
print(f"Unique (State, County, Year) combinations: {n_unique}")
print(f"Unique? {n_rows == n_unique}")

if n_rows != n_unique:
    dupes = df[df.duplicated(subset=key_cols, keep=False)]
    print("\nDuplicate rows:")
    display(dupes.sort_values(key_cols))
else:
    print("\n✓ No duplicates. Each (State, County, Year) appears exactly once.")

Total rows: 978
Unique (State, County, Year) combinations: 978
Unique? True

✓ No duplicates. Each (State, County, Year) appears exactly once.


## 4. Confirm All Day-Count Columns Are Non-Negative

Day counts cannot be negative — negative values would indicate data entry errors.

In [17]:
day_count_cols = [
    'Days with AQI', 'Good Days', 'Moderate Days',
    'Unhealthy for Sensitive Groups Days', 'Unhealthy Days', 'Very Unhealthy Days', 'Hazardous Days',
    'Days CO', 'Days NO2', 'Days Ozone', 'Days PM2.5', 'Days PM10'
]

negative_mask = (df[day_count_cols] < 0).any(axis=1)
n_negative = negative_mask.sum()

print(f"Rows with any negative day count: {n_negative}")

if n_negative > 0:
    print("\nRows with negative values:")
    display(df.loc[negative_mask, day_count_cols])
else:
    print("\n✓ All day-count columns are non-negative.")

Rows with any negative day count: 0

✓ All day-count columns are non-negative.


## 5. Ensure No Day-Count Column Exceeds Days with AQI

For each county-year, the total days with AQI is the ceiling — no individual pollutant or category count can exceed it.

In [18]:
days_with_aqi = df['Days with AQI']
exceeds_mask = pd.Series(False, index=df.index)

for col in day_count_cols:
    if col == 'Days with AQI':
        continue
    exceeds_mask |= (df[col] > days_with_aqi)

n_exceeds = exceeds_mask.sum()
print(f"Rows where any day-count exceeds 'Days with AQI': {n_exceeds}")

if n_exceeds > 0:
    print("\nViolating rows:")
    violate_df = df.loc[exceeds_mask].copy()
    for col in day_count_cols:
        if col != 'Days with AQI':
            violate_df[f"{col}_exceeds"] = violate_df[col] > violate_df['Days with AQI']
    display(violate_df)
else:
    print("\n✓ No day-count column exceeds 'Days with AQI'.")

Rows where any day-count exceeds 'Days with AQI': 0

✓ No day-count column exceeds 'Days with AQI'.


## 6. AQI Missingness — Preserve NaN, Do Not Impute

Pollutant-specific day counts (Days CO, Days NO2, etc.) may be missing when a pollutant was not monitored. We preserve these as NaN and do not impute, since imputation would introduce artificial data.

In [20]:
pollutant_cols = ['Days CO', 'Days NO2', 'Days Ozone', 'Days PM2.5', 'Days PM10']

print("Missing pollutant day counts (likely due to lack of monitoring):")
missing_by_pollutant = df[pollutant_cols].isna().sum()
print(missing_by_pollutant)

print("\n✓ Policy: We preserve NaN. Do not impute unless analytically justified.")

Missing pollutant day counts (likely due to lack of monitoring):
Days CO       0
Days NO2      0
Days Ozone    0
Days PM2.5    0
Days PM10     0
dtype: int64

✓ Policy: We preserve NaN. Do not impute unless analytically justified.


## 7. Consistency Check — Sum of AQI Category Days vs Days with AQI

The sum of Good + Moderate + Unhealthy for Sensitive Groups + Unhealthy + Very Unhealthy + Hazardous should equal Days with AQI (each day falls into exactly one category).

In [21]:
aqi_cat_cols = ['Good Days', 'Moderate Days', 'Unhealthy for Sensitive Groups Days',
                'Unhealthy Days', 'Very Unhealthy Days', 'Hazardous Days']

df['_sum_categories'] = df[aqi_cat_cols].sum(axis=1)
mismatch = df['_sum_categories'] != df['Days with AQI']
n_mismatch = mismatch.sum()

print(f"Rows where sum of AQI categories ≠ Days with AQI: {n_mismatch}")

if n_mismatch > 0:
    print("\nSample of mismatched rows:")
    display(df.loc[mismatch, ['State', 'County', 'Days with AQI', '_sum_categories'] + aqi_cat_cols].head(10))
else:
    print("\n✓ Sum of AQI category days equals Days with AQI for all rows.")

df.drop(columns=['_sum_categories'], inplace=True)

Rows where sum of AQI categories ≠ Days with AQI: 0

✓ Sum of AQI category days equals Days with AQI for all rows.


## 8. Clean Trailing Whitespace in County Names

County names with trailing spaces can cause joins and grouping to fail — we strip whitespace for consistency.

In [22]:
before = (df['County'] != df['County'].str.strip()).sum()
print(f"Counties with leading/trailing whitespace: {before}")

df['County'] = df['County'].str.strip()
df['State'] = df['State'].str.strip()

print("✓ State and County names cleaned.")

Counties with leading/trailing whitespace: 8
✓ State and County names cleaned.


## 9. Summary and Export Cleaned Data

Summarize validation results and save the cleaned dataset for downstream use.

In [23]:
_exceeds = any((df[col] > df['Days with AQI']).any() for col in day_count_cols if col != 'Days with AQI')

print("=== DATA CLEANING SUMMARY ===")
print(f"Total rows: {len(df)}")
print(f"1. Uniqueness (State, County, Year): {'PASS' if df[key_cols].drop_duplicates().shape[0] == len(df) else 'REVIEW'}")
print(f"2. Non-negative day counts: {'PASS' if not (df[day_count_cols] < 0).any().any() else 'REVIEW'}")
print(f"3. No day-count exceeds Days with AQI: {'PASS' if not _exceeds else 'REVIEW'}")
print(f"4. Missingness: NaN preserved in pollutant columns, no imputation")

# Export cleaned data to a new CSV (original file is never modified)
df.to_csv('../aqi-datasets/Access_to_a_Livable_Planet_Dataset_cleaned.csv', index=False)
print("\nCleaned data saved to ../aqi-datasets/Access_to_a_Livable_Planet_Dataset_cleaned.csv")

=== DATA CLEANING SUMMARY ===
Total rows: 978
1. Uniqueness (State, County, Year): PASS
2. Non-negative day counts: PASS
3. No day-count exceeds Days with AQI: PASS
4. Low-coverage counties (<30 days): 36 flagged
5. Missingness: NaN preserved in pollutant columns, no imputation

Cleaned data saved to ../aqi-datasets/Access_to_a_Livable_Planet_Dataset_cleaned.csv


In [24]:
df.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2025,241,174,67,0,0,0,0,87,56,42,0,0,91,150,0
1,Alabama,Clay,2025,239,204,34,1,0,0,0,133,52,32,0,0,0,239,0
2,Alabama,DeKalb,2025,243,191,52,0,0,0,0,93,55,42,0,0,156,87,0
3,Alabama,Elmore,2025,177,172,5,0,0,0,0,64,46,32,0,0,177,0,0
4,Alabama,Etowah,2025,241,153,88,0,0,0,0,87,58,45,0,0,72,169,0
