# Gavin Kendal-Freedman

## Research question/interests

**My main interest is how different chemicals effect air quality ratings, beyond just the generic 'PM2.5'/'PM 10'/CO2/NO(x), like heavy metals or residuals in traditional fuels for motor vehicles or stoves like natural gas/gasoline/diesel fuels, and also potentially common organic solvents that are used in manufacturing (i.e. toluene), and how these different particulates effect AQI. Further, I want to see if there is a significant difference in air quality change across time in urban vs rural areas (not just if urban areas have better/worse air quality), based on specific particulates in the first part, and if there is any correlations between asthma rates in different areas and the levels of pollutants there.**

### Rough Plan for Data Analysis

1. (Already done as part of loading) Combine all EPA data files into one dataframe
2. Remove rows for data thats not applicable, i.e. sample data, wind speed, precipitation, solar radiation, temperature, etc
3. Remove columns that are mostly null/na/nan/missing values or are not interesting to the analysis
3. Remove rows for data thats not adjusted/corrected to be aligned with the rest of the data
5. Remove rows for data with too few data points to be useable for analysis, probably less  than a few hundred/thousand, or rows that contain data thats unique to one city (unique to once city could be used for a time based analysis for a single city, but thats more specific than i'm interested)
6. From the remaining parameters, narrow in to the top 10 to 25(?) parameters to analyze
7. Try to see if there are any obvious correlations via heatmap(s)/basic plots/basic statistical analysis
8. From there, try and find out where the real and observable correlations are to then visualize
9. Unknown as of now

In [2]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [15]:
# Load the Data
# First, lets open and concatenate every years AQI data here using a generator to read each, since the names are programmatic
aqi = pd.concat(
    (pd.read_csv(f"../data/raw/annual_aqi_by_cbsa_{x}.csv") for x in range(2011, 2023))
)
# Second, lets open and concatenate all the concentration data from each year in a similar vein to the AQI data
concentration = pd.concat(
    (
        pd.read_csv(f"../data/raw/annual_conc_by_monitor_{x}.csv")
        for x in range(2011, 2023)
    )
)
# Now, lets merge the two data frames together on the YEAR and CBSA (Name) columns using an inner join
# Inner Join/Merge only joins rows that have matching data for the YEAR and CBSA (Name) columns in both the left and right dataframe
# CBSA is a location identifier (Core Based Statistical Areas) - more info on them can be found here:
#   https://aqs.epa.gov/aqsweb/documents/codetables/cbsas.html
combined = concentration.merge(
    aqi, how="inner", left_on=["Year", "CBSA Name"], right_on=["Year", "CBSA"]
).drop(columns=["CBSA"])
# Now, lets 'print' the first 5 rows to make sure that it joined as expected
combined.head()


Unnamed: 0,Parameter Code,POC,Parameter Name,Sample Duration,Pollutant Standard,Metric Used,Method Name,Year,Units of Measure,Event Type,...,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,44201,1,Ozone,1 HOUR,Ozone 1-hour 1979,Daily maxima of observed hourly values (betwee...,INSTRUMENTAL - ULTRA VIOLET,2011,Parts per million,No Events,...,0,0,126,74,42,0,0,209,73,0
1,44201,1,Ozone,8-HR RUN AVG BEGIN HOUR,Ozone 8-Hour 1997,Daily maximum of 8 hour running average of obs...,,2011,Parts per million,No Events,...,0,0,126,74,42,0,0,209,73,0
2,44201,1,Ozone,8-HR RUN AVG BEGIN HOUR,Ozone 8-Hour 2008,Daily maximum of 8 hour running average of obs...,,2011,Parts per million,No Events,...,0,0,126,74,42,0,0,209,73,0
3,44201,1,Ozone,8-HR RUN AVG BEGIN HOUR,Ozone 8-hour 2015,Daily maximum of 8-hour running average,,2011,Parts per million,No Events,...,0,0,126,74,42,0,0,209,73,0
4,68101,1,Sample Flow Rate- CV,24 HOUR,,Observed Values,R & P Model 2025 PM2.5 Sequent - Calculation,2011,Percent,No Events,...,0,0,126,74,42,0,0,209,73,0
