# Parts 3 and 4:

### In Part 1 & 2, we explored the data and experiment with transformations.
### In Parts 3 & 4, we are going to use a pipeline in model selection and tuning the hyperparameters
### of the selected model and build the prediction function


## Parts 3: Pipeline and Model selection
### Build the data transformation pipeline
### Use the pipeline in Cross Validation to select the best model

## Part 4: Hyperparameter tuning and Prediction Function
### Use GridSearchCV to tune the parameters of the selected model
### Use the tuned model to evaluate the test set
### Save the final_model as a pickle file
### Build 2 Prediction App Functions: predict_mpg and predict_mpg_web

## Parts 3: Pipeline and Model selection

### Build the data transformation pipeline

In [2]:
# In Part 2 we have identified the following data preparations:
#     1. Encoding Origin values into categorical
#     2. OneHotEncode these Origin categories
#     3. Impute missing values with median
#     4. Add 2 new interacting features acc-and power and wt_and_cylinder
#     5. Scale numerric values with MinMaxScaler

In [3]:
# We will use croass validation in model selection. 

In [4]:
# To avoid information leakage during the validation, we need to put transformations 2-5
#  a pipeline to be fed into the cross valiaation loop
# in order to avoid any information leakage.

In [5]:
# To keep this notebook independent from Part 1 and 2, we are going to start fresh

# Import Libraries

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
import math
warnings.filterwarnings('ignore')

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor

# Load the data

In [9]:
# defining the column names
cols = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
# reading the .data file using pandas
df = pd.read_csv('./auto-mpg.data', names=cols, na_values = "?",
                comment = '\t',
                sep= " ",
                skipinitialspace=True)
#making a copy of the dataframe for exploration
data = df.copy()

# Split data into train and test sets

In [10]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [11]:
# Further split train and test into X and y

In [12]:
y_train=strat_train_set['MPG']
X_train=strat_train_set.drop('MPG',1)

In [13]:
y_test=strat_train_set['MPG']
X_test=strat_train_set.drop('MPG',1)

In [14]:
X_test.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,3
151,4,79.0,67.0,2000.0,16.0,74,2
388,4,156.0,92.0,2585.0,14.5,82,1
48,6,250.0,88.0,3139.0,14.5,71,1
114,4,98.0,90.0,2265.0,15.5,73,2


In [15]:
# Redefine the function preprocessing_orgin_cols and the custom column transformer ClassAttrAdder

In [16]:
# Now with the preprocess-origina_cols defined in Part 2
def preprocess_origin_cols(df):
    df["Origin"] = df["Origin"].map({1: "India", 2: "USA", 3: "Germany"})    
    return df

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin

acc_ix, wt_ix, hpower_ix, cyl_ix = 4, 3, 2, 0

##custom class inheriting the BaseEstimator and TransformerMixin
class CustomAttrAdder(BaseEstimator, TransformerMixin):
    def __init__(self, acc_and_power=True):
        self.acc_and_power = acc_and_power  # new optional variable
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        wt_and_cyl = X[:, wt_ix] * X[:, cyl_ix] # required new variable
        if self.acc_and_power:
            acc_and_power = X[:, acc_ix] * X[:, hpower_ix]
            return np.c_[X, acc_and_power, wt_and_cyl] # returns a 2D array
        
        return np.c_[X, wt_and_cyl]

In [18]:
# Apply preproces_origin_cols to X_train

In [19]:
preprocessed_X_train = preprocess_origin_cols(X_train)
preprocessed_X_train.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,Germany
151,4,79.0,67.0,2000.0,16.0,74,USA
388,4,156.0,92.0,2585.0,14.5,82,India
48,6,250.0,88.0,3139.0,14.5,71,India
114,4,98.0,90.0,2265.0,15.5,73,USA


In [20]:
# In Part 2 we have identified the following data preparations:
#     1. Encoding Origin values into categorical
#     2. OneHotEncode these Origin categories
#     3. Impute missing values with median
#     4. Add 2 new interacting features acc-and power and wt_and_cylinder
#     5. Scale numerric values with MinMaxScaler

In [21]:
# We will use croass validation in model selection

In [22]:
# To avoid information leakage during the validation, we need to put transformations 2-5
#  a pipeline to be fed into the cross valiaation loop
# in order to avoid any information leakage.

In [23]:
# Build Pipeline

In [24]:
#numerics = ['float64', 'int64']

#num_attrs = preprocessed_X_train.select_dtypes(include=numerics)

#num_pipeline = Pipeline([
#        ('imputer', SimpleImputer(strategy="median")),
#        ('attrs_adder', CustomAttrAdder()),
#        ('minmax_scaler', MinMaxScaler()),
#        ])

In [25]:
numerics = ['float64', 'int64']

num_attrs = preprocessed_X_train.select_dtypes(include=numerics)

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median", fill_value=0)),
        ('attrs_adder', CustomAttrAdder()),
        ('minmax_scaler', MinMaxScaler()),
        ])

In [26]:
cat_attrs = ["Origin"]

In [27]:
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, list(num_attrs)),
        ("cat", OneHotEncoder(), cat_attrs),
        ])

### Use the pipeline in Cross Validation to select the best model

In [28]:
# Cross validate with LinearRegressor, DecsionTreeRegressor, RandomForest, SVR, AdaBoostRegressor

In [29]:
# Setup the models list

In [30]:
models=[]

In [31]:
models.append(['LR', LinearRegression()])

In [32]:
models.append(['DT', DecisionTreeRegressor()])

In [33]:
models.append(['SVR', SVR(kernel='linear')])

In [34]:
models.append(['RF', RandomForestRegressor()])

In [35]:
models.append(['AdB', AdaBoostRegressor()])

In [36]:
# Initial scores and mdnames lists

In [37]:
scores=[]
mdnames=[]

In [38]:
for mdname, model in models:
    full_pipeline_model=Pipeline(steps= [ ('full_pipeline', full_pipeline),
                                 ('model', model)])
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    score=cross_val_score(full_pipeline_model, preprocessed_X_train, y_train, cv=kfold,scoring='neg_mean_squared_error' )
    reg_rmse_scores=np.sqrt(-score)
    print(mdname, reg_rmse_scores.mean())
    mdnames.append(mdname)
    scores.append(reg_rmse_scores.mean())

LR 2.9593880614663837
DT 3.6686167586621545
SVR 3.499709377844303
RF 2.75583995070712
AdB 2.969122902382881


In [39]:
# We are going to select RandomForestRegressor as the best model
# since it has the lowest rmse

## Part 4: Hyperparamter tunung and Prediction Function

### Use GridSearchCV to tune the parameters of the selected model

In [41]:

from sklearn.model_selection import GridSearchCV
full_pipeline_RF=Pipeline(steps= [ ('full_pipeline', full_pipeline),
                                 ('model', RandomForestRegressor(random_state=42))])

In [42]:
param_grid = [
    {'model__n_estimators': [3, 10, 30], 'model__max_features': [2, 4, 6, 8]},
    {'model__bootstrap': [False], 'model__n_estimators': [3, 10], 'model__max_features': [2, 3, 4]},
  ]

In [43]:
preprocessed_X_train.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,Germany
151,4,79.0,67.0,2000.0,16.0,74,USA
388,4,156.0,92.0,2585.0,14.5,82,India
48,6,250.0,88.0,3139.0,14.5,71,India
114,4,98.0,90.0,2265.0,15.5,73,USA


In [44]:
grid_search = GridSearchCV(estimator=full_pipeline_RF, param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           return_train_score=True,
                           cv=10,
                          )

In [None]:
## Use the transforms + estimator to fit the train data

In [45]:
grid_search.fit(preprocessed_X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('full_pipeline',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(fill_value=0,
                                                                                                        strategy='median')),
                                                                                         ('attrs_adder',
                                                                                          CustomAttrAdder()),
                                                                                         ('minmax_scaler',
                                                                                          MinMaxScaler())]),
                                                                    

In [46]:
grid_search.best_params_

{'model__max_features': 6, 'model__n_estimators': 30}

In [47]:
cv_scores=grid_search.cv_results_

In [48]:
## Printing all the parameters with their scores

In [49]:
for mean_score, params in zip(cv_scores['mean_test_score'], cv_scores['params']):
    print (np.sqrt(-mean_score), params)

3.6131583096274795 {'model__max_features': 2, 'model__n_estimators': 3}
3.129967137890459 {'model__max_features': 2, 'model__n_estimators': 10}
2.863684277878617 {'model__max_features': 2, 'model__n_estimators': 30}
3.263823201504794 {'model__max_features': 4, 'model__n_estimators': 3}
2.9652320123553184 {'model__max_features': 4, 'model__n_estimators': 10}
2.7696854751123614 {'model__max_features': 4, 'model__n_estimators': 30}
3.2780296152722825 {'model__max_features': 6, 'model__n_estimators': 3}
2.8796273227876483 {'model__max_features': 6, 'model__n_estimators': 10}
2.743371725256909 {'model__max_features': 6, 'model__n_estimators': 30}
3.104744696128202 {'model__max_features': 8, 'model__n_estimators': 3}
2.86127489454918 {'model__max_features': 8, 'model__n_estimators': 10}
2.7508379572133803 {'model__max_features': 8, 'model__n_estimators': 30}
3.3013823371462907 {'model__bootstrap': False, 'model__max_features': 2, 'model__n_estimators': 3}
2.9059875618586912 {'model__bootstra

In [50]:
np.sqrt(-grid_search.best_score_)

2.743371725256909

In [51]:
feature_importances=grid_search.best_estimator_[1][1] .feature_importances_

In [52]:
feature_importances

array([0.00000000e+00, 1.35591374e-02, 2.02335155e-02, 2.19930400e-01,
       1.89172237e-02, 1.34431498e-01, 2.30440337e-02, 5.66131052e-01,
       3.40881815e-04, 1.70964069e-03, 1.70261697e-03])

In [53]:
extra_attrs=["acc_and_power","wt_and_cycl"]
numerics=['float64', 'int64']
num_attrs=list(X_train.select_dtypes(include=numerics))
attrs=num_attrs+ extra_attrs
sorted(zip(attrs, feature_importances), reverse=True)

[('wt_and_cycl', 0.5661310523638698),
 ('acc_and_power', 0.023044033727378375),
 ('Weight', 0.21993039961963212),
 ('Model Year', 0.1344314982762287),
 ('Horsepower', 0.020233515467125117),
 ('Displacement', 0.013559137416755306),
 ('Cylinders', 0.0),
 ('Acceleration', 0.01891722365654645)]

In [54]:
# The two new features wt_and_cylinder and acc_and_power appear to be at the top of importance

### Use the tuned model to evaluate the test set

In [55]:
# First the test data is preprocessed to covert Origin integers to Categorices
# Then the fitted gridsearch is used to predict and score the preprocessed test data 

In [56]:
preprocessed_X_test = preprocess_origin_cols(X_test)
preprocessed_X_test.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,Germany
151,4,79.0,67.0,2000.0,16.0,74,USA
388,4,156.0,92.0,2585.0,14.5,82,India
48,6,250.0,88.0,3139.0,14.5,71,India
114,4,98.0,90.0,2265.0,15.5,73,USA


In [57]:
final_predictions = grid_search.predict(preprocessed_X_test)

In [58]:
from sklearn.metrics import mean_squared_error

In [59]:
final_mse=mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [60]:
final_rmse

1.0930832663862664

In [61]:
result= grid_search.score(preprocessed_X_test, y_test)

In [62]:
rsme=np.sqrt(-result)
rsme

1.0930832663862664

In [63]:
# This is the same as we calculate the final_rmse from the mean_squred_error function

In [64]:
# Test it on a single user input configration 

In [65]:
user_input = {
    'Cylinders': [4],
    'Displacement': [155.0],
    'Horsepower': [93.0],
    'Weight': [2500.0],
    'Acceleration': [15.0],
    'Model Year': [81],
    'Origin': [3]
}

In [66]:
user_input_df=pd.DataFrame(user_input)

In [67]:
user_input_df

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,4,155.0,93.0,2500.0,15.0,81,3


In [68]:
preprocessed_user_input_df=preprocess_origin_cols(user_input_df)

In [69]:
grid_search.predict(preprocessed_user_input_df)

array([31.06666667])

In [70]:
grid_search.predict(preprocessed_user_input_df)[0]

31.066666666666666

In [71]:
# Cature the pipeline_m and apply it to the same user input

In [72]:
pipeline_m=grid_search

In [73]:
pipeline_m.predict(preprocessed_user_input_df)[0]

31.066666666666666

### Save the final_model as a pickle file

In [75]:
import pickle

##dump the piepline_m into a file
with open("final_model.pkl", 'wb') as f_out:
    pickle.dump(pipeline_m, f_out) # write into  final_model.pkl
    f_out.close()  # close the file 

In [76]:
##loading the model from the saved file and test it again
with open('final_model.pkl', 'rb') as f_in:
    model = pickle.load(f_in)

In [77]:
model.predict(preprocessed_user_input_df)[0]

31.066666666666666

### Build 2 Prediction App Functions: predict_mpg and predict_mpg_web1

### Build predict_mpg

In [78]:
#This predict_mpg() returns mpg predictions for config and model
# Inputs: config, model
#       config: dictioanry or a dataframe, contains vehicel configuration information
#.              with origin coded in intergers.
#       model: is a pipeline that includes data transformations and the model estimator
##
# This function is NOT to be used in the Streamlit app, since it handles vehicle config
# in the form with Origin column coded with integers.
# Config has to be preprossed by preprocess_origin_col before it can use model.predict

def predict_mpg(config, model):
    if type(config)==dict:
        df=pd.DataFrame(config)
    else:
        df=config
    
    preproc_df=preprocess_origin_cols(df)
# Note the model is in the form of pipeline_m, including both transforms and the estimator
    y_pred=model.predict(preproc_df)
    return y_pred

In [79]:
# We define another form of the prediction function, predict_mpg_web to handle web user inputs with Origin in country code

In [80]:
#This function returns mpg predictions for config and model
# Inputs: config, model
#       config: dictioanry or a dataframe, contains vehicel configuration information
#.              with origin coded with country code: "India", "USA", "Germany"
#       model: is a pipeline that includes data transformations and the model estimator
##
# This function is to be used in the Streamlit app, since it handles vehicle config
# with Origin column already coded with country code.
# Config does not has to be preprossed by preprocess_origin_col before it can use model.predict

def predict_mpg_web(config, model):
    if type(config)==dict:
        df=pd.DataFrame(config)
    else:
        df=config 
# Note the model is in the form of pipeline_m, including both transforms and the estimator
# The config is with Origin already in country code
    y_pred=model.predict(df)
    return y_pred

In [81]:
# Test predcit_mpg_web on user input in dataframe form with Origin with categorical values

In [82]:
predict_mpg_web(preprocessed_user_input_df, model)[0]

31.066666666666666

In [83]:
# Test predict_mpg_web using userinput in dictionary form with Origin with categorical values.
# Note that this is format web spp users will enter the data.

In [84]:
Orig=["USA"]
Cyl=[6.0]
Disp=[193.0]
Power=[104]
WT=[2970]
Acc=[16]
MY=[76]

In [85]:
vehicle={"Origin": Orig, "Cylinders": Cyl, "Displacement": Disp, "Horsepower": Power,
             "Weight":WT, "Acceelation": Acc, "Model Year": MY
            }

In [86]:
type(vehicle)

dict

In [87]:
predict_mpg_web(vehicle, model)[0]

20.296666666666667

In [88]:
# Test predict_mpg with config in dictionary form with Origin coded with integers

In [89]:
#vehicle config
vehicle_config = {
    'Cylinders': [4, 6, 8],
    'Displacement': [155.0, 160.0, 165.5],
    'Horsepower': [93.0, 130.0, 98.0],
    'Weight': [2500.0, 3150.0, 2600.0],
    'Acceleration': [15.0, 14.0, 16.0],
    'Model Year': [81, 80, 78],
    'Origin': [3, 2, 1]
}

In [90]:
predict_mpg(vehicle_config, model)

array([31.06666667, 22.05333333, 20.62      ])

### Build the predict_mpg_web1

In [91]:
# Try a different version of predict_mpg_web1

In [92]:
#This predict_mpg_web returns mpg predictions for config 
# Inputs: config
#       config: dictioanry or a dataframe, contains vehicel configuration information
#.              with origin coded with country code: "India", "USA", "Germany"
#    
##
# This function is to be used in the Streamlit app, since it handles vehicle config
# with Origin column already coded with country code.
# Config does not has to be preprossed by preprocess_origin_col before it can use model.predict

def predict_mpg_web1(config):
    pickle_in = open('final_model.pkl', 'rb')
    model = pickle.load(pickle_in)
    if type(config)==dict:
        df=pd.DataFrame(config)
    else:
        df=config 

# Note the model is in the form of pipeline_m, including both transforms and the estimator
# The config is with Origin already in country code
    y_pred=model.predict(df)
    return y_pred


In [93]:
predict_mpg_web1(vehicle)[0]

20.296666666666667