In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
import functions
warnings.simplefilter(action='ignore', category=FutureWarning)
# Load the dataset
# df = pd.read_csv('data/Merged_AQI_Income_Poverty_Unemployment_Livability.csv')

## **Data**

**Datasets:**  
In this project, we have used two main types of datasets: **Air Quality Index (AQI)** and **Income**.

The links to access these datasets are:  
- **AQI:** https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily  
- **Income:** https://www.census.gov/data/tables/2024/demo/income-poverty/p60-282.html

**Datasets Description:**  
The AQI datasets from the United States Environment Protection Agency (EPA) provide a wide variety of records. In this study, we chose to use datasets for four main pollutants: **Ozone (O₃), Sulfur Dioxide (SO₂), Carbon Monoxide (CO), and Nitrogen Dioxide (NO₂)**, along with four key meteorological variables: **Wind Speed/Direction, Temperature, Barometric Pressure, and Relative Humidity/Dewpoint**.

These pollutants were selected because they are among the most common and impactful urban air pollutants, with well-documented health and environmental effects. The meteorological datasets were included due to their strong influence on pollutant dispersion, chemical transformation, and accumulation in the atmosphere, which are critical for understanding variations in air quality over time and across locations.

## **Data Cleaning**

A single function is called to `combine_datasets_POLLUTANTS`, which handles both the cleaning and merging of multiple pollutant datasets.  
First, the function reads in each pollutant-specific CSV file and cleans the data using the `clean_pollutant_dataset` helper function.  
This step involves filtering by pollutant standard where applicable, converting the date column to a proper datetime format, and removing any invalid or duplicate entries.  

After cleaning, each dataset is renamed to include pollutant-specific suffixes for key measurement columns (e.g., `Arithmetic Mean_ozone`, `1st Max Value_so2`) to avoid naming conflicts during merging.  

The cleaned datasets are then merged on common attributes such as **State Name**, **County Name**, **Date Local**, and **Local Site Name** using an inner join, which ensures that only records present in all datasets are retained.  

Finally, the merged dataset is sorted by location and date, and saved as a new CSV file for further analysis.

In [5]:
functions.combine_datasets_POLLUTANTS(
    "daily_OZONE_2023.csv",  "daily_SO2_2023.csv", "daily_CO_2023.csv", "daily_NO2_2023.csv", "merged_ALL_POLLUTANTS_2023.csv"
)

  df1 = clean_pollutant_dataset(pd.read_csv(file1))
  df2 = clean_pollutant_dataset(pd.read_csv(file2), 'SO2 1-hour 2010')
  df3 = clean_pollutant_dataset(pd.read_csv(file3), 'CO 8-hour 1971')
  df4 = clean_pollutant_dataset(pd.read_csv(file4))


Saved: merged_ALL_POLLUTANTS_2023.csv


In [6]:
functions.combine_datasets_POLLUTANTS(
    "daily_44201_2024.csv",  "daily_42401_2024.csv", "daily_42101_2024.csv", "daily_42602_2024.csv", "merged_ALL_POLLUTANTS_2024.csv"
)

  df1 = clean_pollutant_dataset(pd.read_csv(file1))
  df2 = clean_pollutant_dataset(pd.read_csv(file2), 'SO2 1-hour 2010')
  df3 = clean_pollutant_dataset(pd.read_csv(file3), 'CO 8-hour 1971')


Saved: merged_ALL_POLLUTANTS_2024.csv


Since the datasets are too big and cannot be uploaded to github. We have uploaded the result dataset of the preivous step which is smaller and fully processed.

In [7]:
df_POLLUTANT_2023 = pd.read_csv("merged_ALL_POLLUTANTS_2023.csv")
df_POLLUTANT_2023.head(3)

Unnamed: 0,State Name,County Name,Date Local,Local Site Name,Arithmetic Mean_ozone,1st Max Value_ozone,1st Max Hour_ozone,Arithmetic Mean_so2,1st Max Value_so2,1st Max Hour_so2,Arithmetic Mean_co,1st Max Value_co,1st Max Hour_co,Arithmetic Mean_no2,1st Max Value_no2,1st Max Hour_no2
0,Alabama,Jefferson,2023-01-01,North Birmingham,0.031588,0.041,9,-0.0875,0.0,11,0.126316,0.2,19,5.2375,16.5,18
1,Alabama,Jefferson,2023-01-02,North Birmingham,0.019824,0.023,22,0.329167,0.9,8,0.1375,0.3,0,4.975,9.6,7
2,Alabama,Jefferson,2023-01-03,North Birmingham,0.021471,0.023,22,0.0,0.3,6,0.1125,0.2,17,5.4625,9.9,17


In [8]:
df_POLLUTANT_2024 = pd.read_csv("merged_ALL_POLLUTANTS_2024.csv")
df_POLLUTANT_2024.head(3)

Unnamed: 0,State Name,County Name,Date Local,Local Site Name,Arithmetic Mean_ozone,1st Max Value_ozone,1st Max Hour_ozone,Arithmetic Mean_so2,1st Max Value_so2,1st Max Hour_so2,Arithmetic Mean_co,1st Max Value_co,1st Max Hour_co,Arithmetic Mean_no2,1st Max Value_no2,1st Max Hour_no2
0,Arizona,Maricopa,2024-01-01,CENTRAL PHOENIX,0.007882,0.016,9,0.583333,2.0,0,1.057895,2.6,5,22.083333,34.0,20
1,Arizona,Maricopa,2024-01-01,JLG SUPERSITE,0.009,0.021,10,0.95,2.3,21,0.757895,1.3,8,22.2,37.3,18
2,Arizona,Maricopa,2024-01-02,CENTRAL PHOENIX,0.012765,0.024,10,0.5,2.0,11,0.495833,0.7,23,25.041667,40.0,19


The data cleaning process for Meteorological datasets is similar to Pollutants datasets.

In [9]:
functions.combine_datasets_METEO("daily_PRESS_2023.csv", "daily_RH_DP_2023.csv", "daily_TEMP_2023.csv", "daily_WIND_2023.csv", "merged_ALL_METEO_2023.csv")
functions.combine_datasets_METEO("daily_PRESS_2024.csv", "daily_RH_DP_2024.csv", "daily_TEMP_2024.csv", "daily_WIND_2024.csv", "merged_ALL_METEO_2024.csv")

Saved: merged_ALL_METEO_2023.csv
Saved: merged_ALL_METEO_2024.csv


In [10]:
df_METEO_2023 = pd.read_csv("merged_ALL_METEO_2023.csv")
df_METEO_2023.head(3)

Unnamed: 0,State Name,County Name,Date Local,Local Site Name,Arithmetic Mean_PRESS,1st Max Value_PRESS,1st Max Hour_PRESS,Arithmetic Mean_RH_DP,1st Max Value_RH_DP,1st Max Hour_RH_DP,Arithmetic Mean_TEMP,1st Max Value_TEMP,1st Max Hour_TEMP,Arithmetic Mean_WIND_SPEED,1st Max Value_WIND_SPEED,1st Max Hour_WIND_SPEED,Arithmetic Mean_WIND_DIRECTION,1st Max Value_WIND_DIRECTION,1st Max Hour_WIND_DIRECTION
0,Alabama,Escambia,2023-01-01,PCI MET1,1008.641667,1010.4,9,80.345833,96.8,7,63.320833,77.3,14,1.970833,4.3,14,117.375,307.0,8
1,Alabama,Escambia,2023-01-02,PCI MET1,1008.7375,1010.9,10,87.804167,97.4,9,70.845833,79.8,13,4.904167,7.3,19,64.125,100.0,11
2,Alabama,Escambia,2023-01-03,PCI MET1,1003.883333,1006.6,0,85.5125,93.0,6,73.4375,78.7,11,7.2875,10.7,13,90.625,120.0,15


In [11]:
df_METEO_2024 = pd.read_csv("merged_ALL_METEO_2024.csv")
df_METEO_2024.head(3)

Unnamed: 0,State Name,County Name,Date Local,Local Site Name,Arithmetic Mean_PRESS,1st Max Value_PRESS,1st Max Hour_PRESS,Arithmetic Mean_RH_DP,1st Max Value_RH_DP,1st Max Hour_RH_DP,Arithmetic Mean_TEMP,1st Max Value_TEMP,1st Max Hour_TEMP,Arithmetic Mean_WIND_SPEED,1st Max Value_WIND_SPEED,1st Max Hour_WIND_SPEED,Arithmetic Mean_WIND_DIRECTION,1st Max Value_WIND_DIRECTION,1st Max Hour_WIND_DIRECTION
0,Alabama,Escambia,2024-01-01,PCI MET1,1012.133333,1015.4,23,73.829167,95.1,7,46.3375,58.7,13,4.225,8.2,15,277.333333,337.0,21
1,Alabama,Escambia,2024-01-02,PCI MET1,1013.7875,1016.4,8,71.008333,91.2,22,38.925,54.5,15,3.1875,8.0,9,107.958333,320.0,17
2,Alabama,Escambia,2024-01-03,PCI MET1,1008.520833,1010.0,9,83.125,93.3,23,42.4875,47.9,13,4.779167,10.6,12,97.541667,209.0,22


Finally, since many states and counties are missing in either the 2023 or 2024 datasets, we retain only the records for states and counties that appear in **both** years to maintain consistency between the datasets.

To achieve this, we first identify the common states and counties shared by the two datasets. Then, we remove any records that do not belong to this shared subset. After filtering, we join all four pollutant datasets along with the AQI datasets for both 2023 and 2024 to ensure alignment across all sources.

In [12]:
functions.combine_all_data("merged_ALL_METEO_2023.csv", "merged_ALL_POLLUTANTS_2023.csv", "daily_aqi_by_county_2023.csv", "all_attributes_2023.csv")
functions.combine_all_data("merged_ALL_METEO_2024.csv", "merged_ALL_POLLUTANTS_2024.csv", "daily_aqi_by_county_2024.csv", "all_attributes_2024.csv")

commonStates = functions.get_common_states("all_attributes_2023.csv", "all_attributes_2024.csv")
print("Common states: ", commonStates)

functions.filter_common_states("all_attributes_2023.csv", "24StateAQI_2023.csv")
functions.filter_common_states("all_attributes_2024.csv", "24StateAQI_2024.csv")

functions.combine_csv_files("24StateAQI_2023.csv", "24StateAQI_2024.csv", "AQI.csv")

Saved: all_attributes_2023.csv
Saved: all_attributes_2024.csv
Common states:  ['Arizona', 'California', 'Connecticut', 'Georgia', 'Idaho', 'Illinois', 'Indiana', 'Louisiana', 'Maryland', 'Massachusetts', 'Michigan', 'Missouri', 'Nevada', 'New Hampshire', 'New Mexico', 'North Carolina', 'North Dakota', 'Ohio', 'Pennsylvania', 'Rhode Island', 'Texas', 'Virginia', 'Washington', 'Wyoming']
Saved filtered state file: 24StateAQI_2023.csv
Saved filtered state file: 24StateAQI_2024.csv
