# Project Air Quality

### Domain Name: Environment Air quality

### Abstract: 

Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer.

### Dataset: Air quality of an Italian city 
(https://archive.ics.uci.edu/ml/datasets/Air+quality)

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year) representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. 

Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.

**Attributes of the dataset are:**

|Sl No|	|Attribute|	|Description|
|-|	|-|	|-|
|0|	|Date|	|Date (DD/MM/YYYY) |
|1|	|Time|	|Time (HH.MM.SS) |
|2|	|CO(GT)|	|True hourly averaged concentration CO in mg/m^3 (reference analyzer) |
|3|	|PT08.S1(CO)|	|PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)|
|4|	|NMHC(GT)|	|True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)|
|5|	|C6H6(GT)|	|True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) |
|6|	|PT08.S2(NMHC)|	|PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted) |
|7|	|NOx(GT)|	|True hourly averaged NOx concentration in ppb (reference analyzer) |
|8|	|PT08.S3(NOx)|	|PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) |
|9|	|NO2(GT)|	|True hourly averaged NO2 concentration in microg/m^3 (reference analyzer) |
|10|	|PT08.S4(NO2)|	|PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) |
|11|	|PT08.S5(O3)|	|PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted) |
|12|	|T|	|Temperature in Â°C |
|13|	|RH|	|Relative Humidity (%) |
|14|	|AH|	|AH Absolute Humidity|


### Problem:

Humans are very sensitive to humidity, as the skin relies on the air to get rid of moisture. The process of sweating is your body's attempt to keep cool and maintain its current temperature. If the air is at 100-percent relative humidity, sweat will not evaporate into the air. As a result, we feel much hotter than the actual temperature when the relative humidity is high. If the relative humidity is low, we can feel much cooler than the actual temperature because our sweat evaporates easily, cooling us off. For example, if the air temperature e is 75 degrees Fahrenheit (24 degrees Celsius) and the relative humidity is zero percent, the air temperature feels like 69 degrees Fahrenheit (21 C) to our bodies. If the air temperature is 75 degrees Fahrenheit (24 C) and the relative humidity is 100 percent, we feel like it's 80 degrees (27 C) out. 

### Objective:

So we will **predict the Relative Humidity** of a given point of time based on the all other attributes affecting the change in RH.


### <u>Content:<u>

[1) Load data](#load_data)

[2) Basic statistics](#stat)

[3) Data Cleaning](#hr)
    
[4) Co-relation between variables](#corr)

[5) Influence of features on output-RH](#lin)

[6) Baseline Linear Regression](#LR)

[6a) Conclusion of Baseline Linear Regression](#LRcon)

[7) Feature Engineering and testing model](#FE)

[7a) Conclusion of Feature Engineering and testing](#FEcon)

[8) Decision Tree Regression ](#DT)

[9) Random Forest Regression](#RF)

[10) Support Vector Machine](#SVM)

[11) Conclusion](#conclusion)


In [19]:
#Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
import seaborn as sns
rcParams['figure.figsize']=10,8

In [20]:
#Local path
local_path='../input/'

#### 1) Load data<a name="load_data"></a>

In [21]:
#define header
col=['DATE','TIME','CO_GT','PT08_S1_CO','NMHC_GT','C6H6_GT','PT08_S2_NMHC',
     'NOX_GT','PT08_S3_NOX','NO2_GT','PT08_S4_NO2','PT08_S5_O3','T','RH','AH']

#define number of columns from csv
use=list(np.arange(len(col)))

#read the data from csv
df_air=pd.read_csv(local_path+'AirQualityUCI.csv',header=None,skiprows=1,names=col,na_filter=True,
                   na_values=-200,usecols=use)
df_air.head()

In [22]:
#See the end records of dataframe
df_air.tail()

In [23]:
df_air.dtypes

In [24]:
#drop end rows with NaN values
df_air.dropna(how='all',inplace=True)
#drop RH NAN rows
df_air.dropna(thresh=10,axis=0,inplace=True)

In [25]:
df_air.shape

#### 2) Basic statistics<a name="stat"></a>

In [26]:
df_air.describe()

#### 3) Data Cleaning<a name="hr"></a>

In [27]:
#Split hour from time into new column
df_air['HOUR']=df_air['TIME'].apply(lambda x: int(x.split(':')[0]))
df_air.HOUR.head()

##### How many missing values now?

In [28]:
print('Count of missing values:\n',df_air.shape[0]-df_air.count())

##### Fill missing value strategy

-CO_GT, NOX_GT, NO2_GT will be filled by monthly average of that particular hour

-NHHC_GT will be dropped as it has 90% missing data

In [29]:
df_air['DATE']=pd.to_datetime(df_air.DATE, format='%m/%d/%Y')   #Format date column

In [30]:
# set the index as date
df_air.set_index('DATE',inplace=True)

In [31]:
df_air['MONTH']=df_air.index.month     #Create month column (Run once)
df_air.reset_index(inplace=True)
#df_air.head()

##### Drop column NMHC_GT; it has 90% missing data

In [32]:
df_air.drop('NMHC_GT',axis=1,inplace=True)    #drop col

##### Fill NaN values with monthly average of particular hour

In [33]:
df_air['CO_GT']=df_air['CO_GT'].fillna(df_air.groupby(['MONTH','HOUR'])['CO_GT'].transform('mean'))
df_air['NOX_GT']=df_air['NOX_GT'].fillna(df_air.groupby(['MONTH','HOUR'])['NOX_GT'].transform('mean'))
df_air['NO2_GT']=df_air['NO2_GT'].fillna(df_air.groupby(['MONTH','HOUR'])['NO2_GT'].transform('mean'))

In [34]:
print('Left out missing value:',df_air.shape[0]-df_air.count() )

##### Fill left out NaaN values with hourly average value

In [35]:
df_air['CO_GT']=df_air['CO_GT'].fillna(df_air.groupby(['HOUR'])['CO_GT'].transform('mean'))
df_air['NOX_GT']=df_air['NOX_GT'].fillna(df_air.groupby(['HOUR'])['NOX_GT'].transform('mean'))
df_air['NO2_GT']=df_air['NO2_GT'].fillna(df_air.groupby(['HOUR'])['NO2_GT'].transform('mean'))

#### 4) Understand co-relation between variables<a name="corr"></a>

In [36]:
#Use heatmap to see corelation between variables
sns.heatmap(df_air.corr(),annot=True,cmap='viridis')
plt.title('Heatmap of co-relation between variables',fontsize=16)
plt.show()

#### 5) Try to understand degree of linearity between RH output and other input features<a name="lin"></a>

In [37]:
#plot all X-features against output variable RH
col_=df_air.columns.tolist()[2:]
for i in df_air.columns.tolist()[2:]:
    sns.lmplot(x=i,y='RH',data=df_air,markers='.')

### 6) Linear Regression<a name="LR"></a>

In [38]:
from sklearn.preprocessing import StandardScaler         #import normalisation package
from sklearn.model_selection import train_test_split      #import train test split
from sklearn.linear_model import LinearRegression         #import linear regression package
from sklearn.metrics import mean_squared_error,mean_absolute_error   #import mean squared error and mean absolute error

##### Define Feature (X) and Target (y)

In [39]:
X=df_air[col_].drop('RH',1)     #X-input features
y=df_air['RH']                    #y-input features

##### Normalize Feature variable

In [40]:
ss=StandardScaler()     #initiatilise

In [41]:
X_std=ss.fit_transform(X)     #apply stardardisation

##### Train test split

In [42]:
#split the data into train and test with test size and 30% and train size as 70%
X_train, X_test, y_train, y_test=train_test_split(X_std,y,test_size=0.3, random_state=42)

In [43]:
print('Training data size:',X_train.shape)
print('Test data size:',X_test.shape)

##### Train the model

In [44]:
lr=LinearRegression()
lr_model=lr.fit(X_train,y_train)          #fit the linear model on train data

In [45]:
print('Intercept:',lr_model.intercept_)
print('--------------------------------')
print('Slope:')
list(zip(X.columns.tolist(),lr_model.coef_))

##### Prediction

In [46]:
y_pred=lr_model.predict(X_test)                      #predict using the model
rmse=np.sqrt(mean_squared_error(y_test,y_pred))      #calculate rmse
print('Baseline RMSE of model:',rmse)

#### <u>6a) Conclusion of baseline linear regression model:<a name="LRcon"></a>

This means that we can predict RH using all the features together with **RMSE as 6.01**. Let us call it as baseline model.

### 7) Feature engineering and testing model:<a name="FE"></a>

Try with multiple feature combination and see if RMSE is improving

##### Build RMSE function

In [47]:
# write function to measure RMSE
def train_test_RMSE(feature):
    X=df_air[feature]
    y=df_air['RH']
    X_std_one=ss.fit_transform(X)
    X_trainR,X_testR,y_trainR,y_testR=train_test_split(X_std_one,y,test_size=0.3,random_state=42)
    lr_model_one=lr.fit(X_trainR,y_trainR)
    y_predR=lr_model_one.predict(X_testR)
    return np.sqrt(mean_squared_error(y_testR,y_predR))

In [48]:
col_.remove('RH')        #remove output

In [49]:
print('List of features:',col_)    #print list of features

In [50]:
print('RMSE with Features as',col_[0:2],train_test_RMSE(col_[0:2]))
print('-------------------------')
print('RMSE with Features as',col_[0:6],train_test_RMSE(col_[0:6]))
print('-------------------------')
print('RMSE with Features as',col_[0:9],train_test_RMSE(col_[0:9]))
print('-------------------------')
print('RMSE with Features as',col_[1:5],train_test_RMSE(col_[2:9]))
print('-------------------------')
print('RMSE with Features as',col_[0:11],train_test_RMSE(col_[0:11]))
print('-------------------------')
print('RMSE with Features as',col_[1:12],train_test_RMSE(col_[1:12]))
print('-------------------------')
print('RMSE with Features as',col_[0:13],train_test_RMSE(col_[0:13]))

#### <u>7a) Conclusion of Feature Engineering and testing:<a name="FEcon"></a>

After this experiment it looks that baseline model is performing best

### 8) Decision Tree Regression<a name="DT"></a>

Let us try to apply Decision tree regression technique and see if any improvement happens

In [51]:
from sklearn.tree import DecisionTreeRegressor         #Decision tree regression model
from sklearn.cross_validation import cross_val_score    #import cross validation score package
from sklearn.model_selection import GridSearchCV        #import grid search cv
dt_one_reg=DecisionTreeRegressor()

##### Fit the DT model and predict:

In [52]:
dt_model=dt_one_reg.fit(X_train,y_train)         #fit the model
y_pred_dtone=dt_model.predict(X_test)            #predict

##### RMSE of RH prediction

In [53]:
#calculate RMSE
print('RMSE of Decision Tree Regression:',np.sqrt(mean_squared_error(y_pred_dtone,y_test)))

#### <u>Conclusion:<u>(Decision Tree Regression)

When decision tree regression has been applied we observe significant improvement of **RMSE value to 1.36**

### 9) Random Forest Regression<a name="RF"></a>

Let us apply Random Forest regression and measure RMSE

In [54]:
from sklearn.ensemble import RandomForestRegressor           #import random forest regressor
rf_reg=RandomForestRegressor()

##### Fit the RF model and predict

In [55]:
rf_model=rf_reg.fit(X_train,y_train)         #fit model   
y_pred_rf=rf_model.predict(X_test)           #predict

##### RMSE of RH prediction

In [56]:
#Calculate RMSE
print('RMSE of predicted RH in RF model:',np.sqrt(mean_squared_error(y_test,y_pred_rf)))

##### Lets try to improve on baseline RF model

In [57]:
#define rf parameters
rf_params={'n_estimators':[10,20],'max_depth':[8,10],'max_leaf_nodes':[70,90]}
#define rf grid search
rf_grid=GridSearchCV(rf_reg,rf_params,cv=10)

In [58]:
rf_model_two=rf_grid.fit(X_train,y_train)     #fit the model wtih all grid parameters

In [59]:
y_pred_rf_two=rf_model_two.predict(X_test)        #predict

In [60]:
#Calculate RMSE
print('RMSE using RF grid search method',np.sqrt(mean_squared_error(y_test,y_pred_rf_two)))  

#### <u>Conclusion: Random Forest

Applying Random Forest regression the predicted **RMSE has improved to 0.86**, the default RF algorithm is giving better RMSE value than grid search applied different parameters.

### 10) Support Vector Machine<a name="SVM"></a>

In [61]:
from sklearn.svm import SVR           #import support vector regressor
sv_reg=SVR()

In [62]:
sv_model=sv_reg.fit(X_train,y_train)    #train the model

In [63]:
y_pred_sv=sv_model.predict(X_test)         #predict

In [64]:
#Calculate RMSE of SVR
print('RMSE of SVR model:',np.sqrt(mean_squared_error(y_test,y_pred_sv)))

## Conclusion:<a name="conclusion"></a>

For designing the model for predicting RH, I have applied Linear Regression, Decision Tree, Random Forest, Support Vector Machine. When tested on test data below are RMSE obtained from different algorithms:

**RMSE:** 

-Linear Regression: 6.01

-Decision Tree: 1.36

**-Random Forest: 0.86**

-Support Vector Machine: 3.89

<u>Hence Random Forest algorithm is selected for the prediction of RH using the features.</u>

**Future:** 
Going forward, I would like to try if applying PCA and using day of the month and month of the year as variable, whether model RMSE of prediction gives a better result.