# Context
This dataset deals with pollution in the U.S. Pollution in the U.S. has been well documented by the U.S. EPA. Hence, here are four major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) for every day from 2000 - 2016 and place them neatly in a csv file.

# Content
There is a total of 28 fields:

1. State Code : The code allocated by US EPA to each state
2. County code : The code of counties in a specific state allocated by US EPA
3. Site Num : The site number in a specific county allocated by US EPA
4. Address: Address of the monitoring site
5. State : State of monitoring site
6. County : County of monitoring site
7. City : City of the monitoring site
8. Date Local : Date of monitoring

The four pollutants (NO2, O3, SO2 and O3) each has 5 specific columns. For instance, for NO2:

- NO2 Units : The units measured for NO2
- NO2 Mean : The arithmetic mean of concentration of NO2 within a given day
- NO2 AQI : The calculated air quality index of NO2 within a given day
- NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day
- NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day

Observations totaled to over 1.4 million.

# Acknowledgements
All the data is scraped from the database of U.S. EPA : https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html

Purpose is to practice ML algorithms (clustering in particular) on an unfamiliar dataset 

Source of where I got: https://www.kaggle.com/sogun3/uspollution

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline  

Through the kernels from kaggle, I must acknowledge that I will follow some similiar data cleaning methods from the following source: https://www.kaggle.com/jaeyoonpark/animation-basemap-plotly-for-air-quality-index
## 1. Preview dataset

In [2]:
df = pd.read_csv('pollution_us_2000_2016.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


## 1.1 Data Clean
Drop columns that are not useful like state code, country code, site num and etc. as they do not provide any statistical significance. Although for carbon gas units, we should mention in the plots to remind ourselves. 

In [3]:
df = df.drop(['Unnamed: 0','State Code','County Code','Site Num','Address','NO2 Units','O3 Units','SO2 Units','CO Units'],axis=1)
df.head()

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,3.0,9.0,21,13.0,1.145833,4.2,21,
1,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,3.0,9.0,21,13.0,0.878947,2.2,23,25.0
2,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,2.975,6.6,23,,1.145833,4.2,21,
3,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,19,46,0.0225,0.04,10,34,2.975,6.6,23,,0.878947,2.2,23,25.0
4,Arizona,Maricopa,Phoenix,2000-01-02,22.958333,36.0,19,34,0.013375,0.032,10,27,1.958333,3.0,22,4.0,0.85,1.6,23,


Some entries have several values for the same observation date. As there's no specific explanation for these duplications nor answers to questions to the forum, I'll get the mean values for each date and location (state in the case below).

In [4]:
## Prepare all 4 AQIs against state and date 
## Too laggy so only showed .head()
dfNO2 = df[['State','City','Date Local','NO2 Mean','NO2 AQI']]
dfO3 = df[['State', 'City', 'Date Local','O3 Mean', 'O3 AQI']]
dfCO = df[['State','City','Date Local','CO Mean','CO AQI']]
dfSO2 = df[['State', 'City', 'Date Local','SO2 Mean', 'SO2 AQI']]

In [5]:
dfNO2.head()

Unnamed: 0,State,City,Date Local,NO2 Mean,NO2 AQI
0,Arizona,Phoenix,2000-01-01,19.041667,46
1,Arizona,Phoenix,2000-01-01,19.041667,46
2,Arizona,Phoenix,2000-01-01,19.041667,46
3,Arizona,Phoenix,2000-01-01,19.041667,46
4,Arizona,Phoenix,2000-01-02,22.958333,34


In [6]:
## Prepare all 4 AQIs
dfNO2 = dfNO2.dropna(axis='rows')
dfO3 = dfO3.dropna(axis='rows')
dfCO = dfCO.dropna(axis='rows')
dfSO2 = dfSO2.dropna(axis='rows')

In [7]:
dfNO2 = dfNO2[dfNO2.State!='Country Of Mexico']  # Delete Mexico
dfO3 = dfO3[dfO3.State!='Country Of Mexico']  # Delete Mexico
dfCO = dfCO[dfCO.State!='Country Of Mexico']  # Delete Mexico
dfSO2 = dfSO2[dfSO2.State!='Country Of Mexico']  # Delete Mexico

In [8]:
dfNO2['Date Local'] = pd.to_datetime(dfNO2['Date Local'],format='%Y-%m-%d')  # Change date from string to date value
dfO3['Date Local'] = pd.to_datetime(dfO3['Date Local'],format='%Y-%m-%d')  # Change date from string to date value
dfCO['Date Local'] = pd.to_datetime(dfCO['Date Local'],format='%Y-%m-%d')  # Change date from string to date value
dfSO2['Date Local'] = pd.to_datetime(dfSO2['Date Local'],format='%Y-%m-%d')  # Change date from string to date value
# Remove duplicates
'''dfNO2Grp = dfNO2.groupby(['State','City','Date Local']).mean()
dfOGrp = dfO3.groupby(['State','City','Date Local']).mean()
dfCOGrp = dfCO.groupby(['State','City','Date Local']).mean()
dfSO2Grp = dfSO2.groupby(['State','City','Date Local']).mean()''';

In [None]:
dfNO2.drop_duplicates()
dfNO2Val = dfNO2[['NO2 Mean', 'NO2 AQI']].values
dfNO2Val

array([[ 19.041667,  46.      ],
       [ 19.041667,  46.      ],
       [ 19.041667,  46.      ],
       ..., 
       [  0.93913 ,   1.      ],
       [  0.93913 ,   1.      ],
       [  0.93913 ,   1.      ]])

Goal: Time series clustering on United States Map

Animation from date x to date y of clustering on United States map 

In [None]:
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(dfNO2Val, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distance')
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
dfNO2_hc = hc.fit_predict(dfNO2Val)

In [None]:
'''# Visualising the clusters
# Graph clusters
# plt.scatter(x = Annual Income,y = Spending Score,s = size, c = color, label = 'whatever you want')
plt.scatter(dfNO2Grp[dfNO2Grp_hc == 0,0], dfNO2Grp[dfNO2Grp_hc == 0,1], s = 100, c = 'red', label = 'Careful')
plt.scatter(dfNO2Grp[dfNO2Grp_hc == 1,0], dfNO2Grp[dfNO2Grp_hc == 1,1], s = 100, c = 'blue', label = 'Standard')
plt.scatter(dfNO2Grp[dfNO2Grp_hc == 2,0], dfNO2Grp[dfNO2Grp_hc == 2,1], s = 100, c = 'green', label = 'Target')
plt.scatter(dfNO2Grp[dfNO2Grp_hc == 3,0], dfNO2Grp[dfNO2Grp_hc == 3,1], s = 100, c = 'cyan', label = 'Careless')
plt.scatter(dfNO2Grp[dfNO2Grp_hc == 4,0], dfNO2Grp[dfNO2Grp_hc == 4,1], s = 100, c = 'magenta', label = 'Sensible')
plt.title('Clusters of clients')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()'''

Too much data, takes too long to cluster the data. I was able to cluster the data once, but it gave me a linear line, not really understanding the df[df_hc == 'some value'] for (x,y) clustering 