# Air Quality in the US from 2012-2022
This is a Jupyter notebook analyzing the air quality in the United States over a decade, from 2012 to 2022. All data is taken from the [Environmental Protection Agency Website](https://www.epa.gov/) and consolidated into one dataset.

The purpose of this notebook is to analyze air quality trends across the US and to explore the various relationship between air quality and other factors, such as year, geographic location, and level of particulate matter.

A visualization of this data is currently in development on Tableau.

### Feature Dictionary

**State**: State in the US that the data was taken in

**County**: County within the state

**Days with AQI**: Number of days in the year having an Air Quality Index value. This is the number of days on which measurements from any monitoring site in the county or MSA were reported to the AQS database.*

**Good Days**: Number of days in the year having an AQI value 0 through 50.*

**Moderate Days**: Number of days in the year having and AQI value 51 through 100.*

**Unhealthy for Sensitive Groups Days**: Number of days in the year having an AQI value 101 through 150.*

**Unhealthy Days**: Number of days in the year having an AQI value 151 through 200.*

**Very Unhealthy Days**: Number of days in the year having an AQI value 201 through 300.*

**Hazardous Days**: Number of days in the year having an AQI value 301 or higher.  Note: The official AQI hazardous category range is 301-500.  Values above 500 are considered “Beyond the AQI” and are included in the # Days Hazardous in this report.*

**Max AQI**: The highest daily AQI value in the year.*

**AQI 90th Percentile**: 90 percent of daily AQI values during the year were less than or equal to the 90th percentile value.*

**AQI Median**: Half of daily AQI values during the year were less than or equal to the median value, and half equaled or exceeded it.*

**Days CO**: Number of days Carbon Monoxide was the main pollutant

**Days NO2**: Number of days Nitrogen Dioxide was the main pollutant

**Days Ozone**: Number of days Ozone was the main pollutant

**Days PM2.5**: Number of days PM2.5 was the main pollutant

**Days PM10**: Number of days PM10 was the main pollutant

\* Adapted from the [EPA website](https://www.epa.gov/outdoor-air-quality-data/about-air-data-reports)

## An Initial Look at the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing the data from the consolidated set
aqi = pd.read_csv('data/annual_aqi_by_county_2012-2022.csv').copy()

aqi.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2012,284,226,56,2,0,0,0,112,61,38,0,0,210,74,0
1,Alabama,Clay,2012,121,99,22,0,0,0,0,72,56,35,0,0,0,121,0
2,Alabama,Colbert,2012,283,222,55,6,0,0,0,136,62,40,0,0,209,74,0
3,Alabama,DeKalb,2012,361,282,74,5,0,0,0,115,64,40,0,0,320,41,0
4,Alabama,Elmore,2012,245,212,33,0,0,0,0,100,54,40,0,0,245,0,0


Checking for data integrity and features

In [3]:
aqi.isnull().sum()

State                                  0
County                                 0
Year                                   0
Days with AQI                          0
Good Days                              0
Moderate Days                          0
Unhealthy for Sensitive Groups Days    0
Unhealthy Days                         0
Very Unhealthy Days                    0
Hazardous Days                         0
Max AQI                                0
90th Percentile AQI                    0
Median AQI                             0
Days CO                                0
Days NO2                               0
Days Ozone                             0
Days PM2.5                             0
Days PM10                              0
dtype: int64

In [4]:
columns = aqi.columns
columns

Index(['State', 'County', 'Year', 'Days with AQI', 'Good Days',
       'Moderate Days', 'Unhealthy for Sensitive Groups Days',
       'Unhealthy Days', 'Very Unhealthy Days', 'Hazardous Days', 'Max AQI',
       '90th Percentile AQI', 'Median AQI', 'Days CO', 'Days NO2',
       'Days Ozone', 'Days PM2.5', 'Days PM10'],
      dtype='object')

In [5]:
states = aqi['State'].unique()
states

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Country Of Mexico', 'Delaware',
       'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virgin Islands', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'], dtype=object)

Let's create a new DataFrame to store the averages for every state

In [6]:
averages = pd.DataFrame()

In [7]:
for state in states:
    averages[state] = aqi[aqi['State'] == state].mean().drop('Year')

In [8]:
averages = averages.T
averages

Unnamed: 0,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
Alabama,271.086486,217.545946,52.491892,0.951351,0.091892,0.005405,0.0,97.210811,54.935135,38.178378,0.005405,0.243243,154.637838,114.048649,2.151351
Alaska,273.164706,234.494118,31.670588,3.764706,3.082353,0.105882,0.047059,103.741176,49.176471,21.0,1.141176,0.905882,72.658824,162.847059,35.611765
Arizona,350.048951,231.097902,104.517483,10.909091,1.853147,1.300699,0.370629,186.482517,71.566434,44.174825,0.020979,1.426573,205.531469,33.776224,109.293706
Arkansas,237.574803,196.259843,40.519685,0.748031,0.047244,0.0,0.0,90.181102,55.23622,36.244094,0.0,2.275591,138.669291,96.559055,0.070866
California,341.032479,203.454701,109.435897,20.683761,6.374359,0.642735,0.441026,259.536752,84.454701,47.694017,0.128205,4.116239,206.17094,116.136752,14.480342
Colorado,289.449848,213.276596,69.905775,5.762918,0.462006,0.036474,0.006079,118.693009,60.644377,37.00304,0.069909,7.364742,205.541033,25.495441,50.978723
Connecticut,306.113636,241.284091,56.636364,6.954545,1.238636,0.0,0.0,142.795455,64.25,40.204545,0.215909,7.181818,210.306818,88.102273,0.306818
Country Of Mexico,239.25,140.85,73.65,16.8,6.9,0.55,0.5,208.4,99.0,52.95,12.5,5.55,119.55,69.75,31.9
Delaware,336.030303,254.878788,76.727273,4.030303,0.393939,0.0,0.0,121.727273,63.030303,41.363636,0.030303,2.242424,232.0,101.757576,0.0
District Of Columbia,351.454545,214.363636,130.454545,5.909091,0.727273,0.0,0.0,142.272727,69.454545,46.454545,0.090909,27.636364,166.727273,157.0,0.0


Save that data to a new csv file

In [9]:
averages.to_csv('data/annual_aqi_by_county_averages.csv')