# Time series Analysis - Air Quality #


<img src="https://timedotcom.files.wordpress.com/2016/01/beijing-air-pollution.jpeg" width="300" height="200"/>

##### What's in the dataset? - #####
Hourly measurements from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device installed at road-level

##### Location - Some city in Italy #####
##### Time Period -  March 2004 to February 2005 #####

More details about the dataset can be found here - [link](https://archive.ics.uci.edu/ml/datasets/Air+Quality)

***

**Objective:**

Perform analysis on individual features to know their distribution. Also see relation between features. Can we make predictions

Before performing analysis on any dataset it is always a good idea to first list what your intuition about the analysis is. It may be incorrect, doesn't matter. The aim of this step is to avoid exploring the whole dataset without any direction.

This process is also called as *Hypothesis Generation*.

**Hypotheses**

1. Over time, the temperature must show increasing trend, excluding the seasonal variations

***

*Import Libraries*

In [49]:
import pandas as pd
import numpy as np

In [50]:
df = pd.read_excel('../data/AirQualityUCI/AirQualityUCI.xlsx')

In [51]:
print('Record count',df.shape[0])

Record count 9357


*Let's look at the type of each column*

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
Date             9357 non-null datetime64[ns]
Time             9357 non-null object
CO(GT)           9357 non-null float64
PT08.S1(CO)      9357 non-null float64
NMHC(GT)         9357 non-null int64
C6H6(GT)         9357 non-null float64
PT08.S2(NMHC)    9357 non-null float64
NOx(GT)          9357 non-null float64
PT08.S3(NOx)     9357 non-null float64
NO2(GT)          9357 non-null float64
PT08.S4(NO2)     9357 non-null float64
PT08.S5(O3)      9357 non-null float64
T                9357 non-null float64
RH               9357 non-null float64
AH               9357 non-null float64
dtypes: datetime64[ns](1), float64(12), int64(1), object(1)
memory usage: 1.1+ MB


All of them are float, except 'Time' which we'll combine with 'Date' field to get a TimeStamp for every record. 

Combining will help us to see trend at different levels such as hourly, daily and monthly

In [53]:
df['DateTime'] = df['Date'].astype(str) +' '+ df['Time'].astype(str)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['DateTime'].head(2)

0   2004-03-10 18:00:00
1   2004-03-10 19:00:00
Name: DateTime, dtype: datetime64[ns]

*Displaying top 2 rows*

In [54]:
df.head(2)

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,DateTime
0,2004-03-10,18:00:00,2.6,1360.0,150,11.881723,1045.5,166.0,1056.25,113.0,1692.0,1267.5,13.6,48.875001,0.757754,2004-03-10 18:00:00
1,2004-03-10,19:00:00,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.3,47.7,0.725487,2004-03-10 19:00:00


*Let's look at the summary of the values for every column*

In [55]:
df.describe()

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,-34.207524,1048.869652,-159.090093,1.865576,894.475963,168.6042,794.872333,58.135898,1391.363266,974.951534,9.7766,39.483611,-6.837604
std,77.65717,329.817015,139.789093,41.380154,342.315902,257.424561,321.977031,126.931428,467.192382,456.922728,43.203438,51.215645,38.97667
min,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,0.6,921.0,-200.0,4.004958,711.0,50.0,637.0,53.0,1184.75,699.75,10.95,34.05,0.692275
50%,1.5,1052.5,-200.0,7.886653,894.5,141.0,794.25,96.0,1445.5,942.0,17.2,48.55,0.976823
75%,2.6,1221.25,-200.0,13.636091,1104.75,284.2,960.25,133.0,1662.0,1255.25,24.075,61.875,1.296223
max,11.9,2039.75,1189.0,63.741476,2214.0,1479.0,2682.75,339.7,2775.0,2522.75,44.6,88.725,2.231036


We can see all of them have -200 as the minimum value. This is because the authors of the dataset used -200 to indicate missing values. 

One naive thought would be to replace them by 0. But be careful, doing so for the T (Temperature) column will change the meaning. We must understand the scale of each column before doing any imputation.

Just looking at the summary we can start finding insights:
1. NMHC(GT) has mean as negative, and all percentiles as -200, except max. This indicates we barely have values in this column. In a professional setting, we must ask the authors what went wrong
while collecting the data for this column.