# CMSC 320 Final Project &ndash; US Pollution

Contributors: Michael Wang, Trueman Phambdo, Jason Lee

***

<img src="../input/pollution.png" alt="pollution">
<i>image by <a href="https://dribbble.com/Furmanczuk">Magda</a></i>

## Introduction: <a class="anchor" id="intro"></a>

***

### &ndash;  Air pollution basics &ndash; 
Air pollution is simply when the quality of the air you breathe drops due to the presence of harmful, unwanted substances in the air, namely, **pollutants**.

In this tutorial, we will mainly explore 4 of those 5 types of pollutants that are present in our database ($PM$ is not in the database): 

* Sulphur dioxide ($SO_2$) : This contaminant is mainly emitted during the combustion of fossil fuels such as crude oil and coal.
* Carbon monoxide ($CO$) : This gas consists during incomplete combustion of fuels example :  A car engine running in a closed room.
* Nitrogen dioxide ($NO_2$) : These contaminants are emitted by traffic, combustion installations and the industries.
* Ozone ($O_3$) : Ozone is created through the influence of ultra violet sunlight (UV) on pollutants in the outside air.
* Particulate Matter ( $PM$ ) : Particulate matter is the sum of all solid and liquid particles suspended in air. This complex mixture includes both organic and inorganic particles, such as dust, pollen, soot, smoke, and liquid droplets. These particles vary greatly in size, composition, and origin.

Predictions might have been more accurate if the dataset contained particle pollution (or particulate matter - $PM$) level too, which would cause similarly important health issues.

### &ndash;  Necessary libraries &ndash; 
* pandas
* numpy
* matplotlib

### &ndash;  About this dataset &ndash; 
This dataset deals with pollution in the U.S. and is scraped from the <a href="https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html">U.S. EPA database</a>. It contains data on four major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) for every day from 2000 - 2016.

## Table of Content :

***

* [Introduction](#intro)

* [1. Data preparation](#data)

* [2. Exploratory data analysis](#eda)

* [3. Analysis](#analysis)

* [4. Insights](#insights)

## 1. Data preparation<a class="anchor" id="data"></a>

***

In [1]:
## import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

In [2]:
## read csv
df = pd.read_csv("../input/pollution_us_2000_2016.csv")
print('** successfully imported csv as dataframe **')

** successfully imported csv as dataframe **


### Peek through the dataset

The air quality level of each category ($NO_2$, etc.) is defined by the Air Quality Index (AQI). More detailed information on the Index can be found in the <a href="https://cfpub.epa.gov/airnow/index.cfm?action=aqibasics.aqi">EPA Air QUality Basic Page</a>.

-- Remove unnecessary columns --

* Index column (Unnamed: 0) serves no purpose.
* State Code, Country Code, Site Num, Address are removed since State, Country and City columns are sufficent.
* All the units columns (NO2 Units, etc.) are removed since they only have 1 unique value representing their unit.

In [3]:
df = df.drop(['Unnamed: 0','State Code','County Code','Site Num','Address','NO2 Units','O3 Units','SO2 Units','CO Units'],axis=1)
df.head()

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,3.0,9.0,21,13.0,1.145833,4.2,21,
1,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,3.0,9.0,21,13.0,0.878947,2.2,23,25.0
2,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,2.975,6.6,23,,1.145833,4.2,21,
3,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,2.975,6.6,23,,0.878947,2.2,23,25.0
4,Arizona,Maricopa,Phoenix,2000-01-02,22.958333,36.0,19,34,0.013375,0.032,10,27,1.958333,3.0,22,4.0,0.85,1.6,23,


As seen above, some entries are **duplications** with the same observation date (Date Local column). Since there's no specific explanation for these duplications, we will calculate the mean values for each date and city.

Also, delete entries with 'Country of Mexico' as its state since we are only dealing with pollutions in the U.S.

Lastly, change Date Local from string to datetime

In [9]:
# remove entries with NA
df = df.dropna(axis='rows')

# replace duplications with a single entry of mean values
df = df.groupby(['State','County','City','Date Local']).mean().reset_index()

# remove Mexico
df = df[df.State!='Country Of Mexico']

# Change date from string to date value
df['Date Local'] = pd.to_datetime(df['Date Local'], format='%Y-%m-%d')

According to the <a href="https://www3.epa.gov/airnow/aqi_brochure_02_14.pdf">Air Quality Index guide</a> provided by EPA (page 3), the highest of AQI values for each category is reported as the overall AQI value.

Create a new column to record the overall AQI.

In [12]:
df['AQI'] = df[['NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI']].max(axis=1)
df.head()

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,...,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,AQI
0,Alabama,Jefferson,Birmingham,2013-12-01,17.208333,39.3,18.0,37.0,0.013542,0.026,...,24.0,0.313636,1.0,11.0,1.0,0.266667,0.5,0.0,6.0,37.0
1,Alabama,Jefferson,Birmingham,2013-12-02,20.6875,32.4,7.0,30.0,0.009375,0.013,...,12.0,0.53,2.4,11.0,3.0,0.4,0.5,0.0,6.0,30.0
2,Alabama,Jefferson,Birmingham,2013-12-03,14.9125,22.4,17.0,21.0,0.008167,0.012,...,11.0,0.305263,2.3,11.0,3.0,0.258333,0.3,0.0,3.0,21.0
3,Alabama,Jefferson,Birmingham,2013-12-04,7.825,19.3,17.0,18.0,0.011125,0.014,...,13.0,0.131818,1.3,17.0,1.0,0.116667,0.2,20.0,2.0,18.0
4,Alabama,Jefferson,Birmingham,2013-12-05,8.004762,16.0,7.0,15.0,0.010083,0.014,...,13.0,0.0,1.1,0.0,1.0,0.108333,0.2,19.0,2.0,15.0


<a class="anchor" id="eda"></a>
## 2. Exploratory data analysis

***

## 3. Analysis<a class="anchor" id="analysis"></a>

***

## 4. Insights<a class="anchor" id="insights"></a>

***