# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.6 Train/Test Split](#4.6_Train/Test_Split)
  * [4.7 Initial Not-Even-A-Model](#4.7_Initial_Not-Even-A-Model)
    * [4.7.1 Metrics](#4.7.1_Metrics)
      * [4.7.1.1 R-squared, or coefficient of determination](#4.7.1.1_R-squared,_or_coefficient_of_determination)
      * [4.7.1.2 Mean Absolute Error](#4.7.1.2_Mean_Absolute_Error)
      * [4.7.1.3 Mean Squared Error](#4.7.1.3_Mean_Squared_Error)
  * [4.13 Save best model object from pipeline](#4.13_Save_best_model_object_from_pipeline)
  * [4.14 Summary](#4.14_Summary)


## 4.2 Introduction<a id='4.2_Introduction'></a>

Last two notebooks include data collection, cleaning and exploratory data analysis. Since we have two datasets, one for weather at Kathmandu airport and other includes daily air quality measurment in the US embassy at Phora Durbar. In this part of notebook, I am going to build few models to predict air quality on test data and measure the performance of each of these models. 

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

## 4.4 Load Data<a id='4.4_Load_Data'></a>

Loading the saved data after conducting exploratory data analysis. This data 

data = pd.read_csv('../data/realdata/exploratory_air_quality.csv',index_col='date')
data.head()

In [3]:
data.shape

(1675, 8)

## 4.4.1 Encoding season variable<a id='4.4.1_Encoding_Season'></a>

Month variable include integer values. Lets check values count of this variable.

In [17]:
# season value and their percentage on total data
data.season.value_counts(), data.season.value_counts()/len(data)


(3    459
 4    431
 2    423
 1    362
 Name: season, dtype: int64,
 3    0.274030
 4    0.257313
 2    0.252537
 1    0.216119
 Name: season, dtype: float64)

- There are no data from 2017 Janaury, Febrary. But the point here is season only have four integer value. Lets make one hot encoding of this data so that there season value does not count as number in out model

In [19]:
# one hot encoding with pandas get_dummies function
data = pd.get_dummies(data, columns=['season'], prefix='season')

## 4.5 Train/Test Split<a id='4.6_Train/Test_Split'></a>

So far, we've treated our data as a single entity. In machine learning, when we train our model on all of the data, there would not be any data set aside to evaluate model performance. We could keep making more and more complex models that fit the data better and better and not realise it is overfitting to that one set of samples. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, we  have a somewhat independent assessment of how out model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance and then compare multiple models to pick the best. This means their overall model selection process is  fitting to one specific data set, now the test split. You could keep going, trying to get better and better performance on that one data set, but that's  where cross-validation becomes especially useful. While training models, a test split is very useful as a final check on expected future performance.

What partition sizes would you have with a 70/30 train/test split?

In [20]:
# use train_test_split function to split data
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns='AQI'), 
                                                    data.AQI, test_size=0.3, 
                                                    random_state=47)

In [21]:
# check the data size
X_train.shape, X_test.shape

((1172, 10), (503, 10))

In [22]:
# also check training and testing label size.
y_train.shape, y_test.shape

((1172,), (503,))

In [23]:
#Check the `dtypes` attribute of `X_train` to verify all features are numeric
X_train.dtypes

T           float64
H           float64
PP          float64
VV          float64
V           float64
VM          float64
season_1      uint8
season_2      uint8
season_3      uint8
season_4      uint8
dtype: object

In [24]:
#Repeat this check for the test split in `X_test`
X_test.dtypes

T           float64
H           float64
PP          float64
VV          float64
V           float64
VM          float64
season_1      uint8
season_2      uint8
season_3      uint8
season_4      uint8
dtype: object

We have only numeric features in your X now!

## 4.6 Initial Not-Even-A-Model<a id='4.6_Initial_Not-Even-A-Model'></a>

A good place to start is to see how good the mean is as a predictor. In other words, what if you simply say your best guess is the average value of air quality?

In [25]:
#Calculate the mean of `y_train`
train_mean = y_train.mean()
train_mean

110.70136518771331

`sklearn`'s `DummyRegressor` can be use to do the same thing easily

In [26]:
#Fit the dummy regressor on the training data
#Hint, call its `.fit()` method with `X_train` and `y_train` as arguments
#Then print the object's `constant_` attribute and verify it's the same as the mean above
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

array([[110.70136519]])

Use dummyregressor to predict on training data and also use same value for test data. See what is the baseline value 
we can get. 

In [27]:
y_tr_pred = dumb_reg.predict(X_train)

In [28]:
# get base R2-value from this prediciton
r2_score(y_train, y_tr_pred)

0.0

Exactly as expected, if you use the average value as your prediction, you get an  𝑅2  of zero on our training set. What if you use this "model" to predict unseen values from the test set? Remember, of course, that your "model" is trained on the training set; you still use the training set mean as your prediction.

In [29]:
y_te_pred = train_mean * np.ones(len(y_test))
r2_score(y_test, y_te_pred)

-0.0004345281997877315

Huh it is negative. It is worse than training set in reality also true. For training we got 0.0. So obviouly worse below 0.0 is negative.

## 4.7 Linear Regression Model<a id='4.7_Linear_Regression_Model'></a>

In [34]:
regressor = LinearRegression()
regressor.fit(X_train,y_train)

LinearRegression()

In [35]:
# regression score function give r2_score in linear regression model of sklearn
regressor.score(X_train,y_train)

0.7961188673535644