# Section 5. Modeling 

In this notebook, we will be showing examples of building machine learning models, improving model performances through hyperparameter tuning

### CONTENTS
* <a href='04 - DSC 2022 Feature Engineering.ipynb#top'>**Section 4. Feature Engineering**</a> 
* <a href='05 - DSC 2022 Modeling.ipynb#top'>**Section 5. Modeling**</a>
  * [1. Machine learning](#ml)
  * [2. Improving model performance](#improve)
* <a href='06 - DSC 2022 Modeling with Deep Learning.ipynb#top'>**Section 6. Modeling with Deep Learning**</a>
* <a href='07 - DSC 2022 Submission.ipynb#top'>**Section 7. Submission**</a>


- For **best model performance**, the metric we will measure you on is __[Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)__(__MSE__). The equation is:
\begin{equation*}
MSE   = \frac{\sum_{i = 1}^{N}MSE_i}{N}, \text{where N is the number of observations}.
\end{equation*} That is, the MSE score is the average MSE you get across all observations. The lower the MSE, the better the model.

    For each obersevation $i$, $MSE_i$ is calculated as 
\begin{equation*}
MSE_i   = \frac{\sum_{t=1, 7, 30, 90, 180}(y^t_{i,true}-y^t_{i,pred})^2}{5}.
\end{equation*} 

- For **best model explainability**, we are looking for models that best answer the initial hypotheses and make most sense from the business side. Our evaluation function returns an **accuracy score** other than MSE. In reality, we care more about whether we get the direction of returns right(that is, the sign of returns). The metric accuracy is designed for measuring how well your model catches that directions. The higher the accuracy, the better the model. 

You can find the evaluation function in **evaluation.py**.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
import warnings
from feature_engineering import *
from evaluation import evaluation
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV

In [2]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
X_train, X_test, y_train, y_test = feature_engineering(cmg, test_frac = 0.3)
print('Train set has {} observations'.format(X_train.shape[0]))
print('Test set has {} observations'.format(X_test.shape[0]))

Train set has 6989 observations
Test set has 2996 observations


Prior to fitting any models, it is import to develop a baseline. If our model cannot beat a dummy baseline, then we might have a problem!Let's develop a dummy baseline by guessing post deal returns to be the average of returns in the training set respectively. 

In [3]:
print(evaluation(y_train.to_numpy(), np.repeat(y_train.mean().to_numpy().reshape(1, -1), y_train.shape[0], axis = 0)))
print(evaluation(y_test.to_numpy(), np.repeat(y_train.mean().to_numpy().reshape(1, -1), y_test.shape[0], axis = 0)))

{'MSE': 15.615496710489984, 'ACC': 0.6514808985548788}
{'MSE': 25.39174798757925, 'ACC': 0.6447263017356438}


<a id='ml'></a>
## 1. Machine Learning 

### Linear Regression

Let's start from the easiest regression model - linear regression. Looks like the linear regression model is a little bit better than our baseline. 

In [4]:
lr_model = LinearRegression().fit(X_train, y_train)
print(evaluation(y_train.to_numpy(), lr_model.predict(X_train)))
print(evaluation(y_test.to_numpy(), lr_model.predict(X_test)))

{'MSE': 9.604357631932416, 'ACC': 0.6125339819716721}
{'MSE': 23.53859908896831, 'ACC': 0.5990654205607456}


### K-nearest neighbors 

The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

In [5]:
knn_model = KNeighborsRegressor().fit(X_train, y_train)
print('KNN on train set:\n', evaluation(y_train.to_numpy(), knn_model.predict(X_train)))
print('KNN on test set:\n', evaluation(y_test.to_numpy(), knn_model.predict(X_test)))

KNN on train set:
 {'MSE': 8.624586397283258, 'ACC': 0.7299470596652045}
KNN on test set:
 {'MSE': 23.893491749984697, 'ACC': 0.6489986648865107}


### Tree methods 

The decision tree regressor breaks down a dataset into smaller and smaller subsets while at the same time a decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. 

How do we make splits then? The default function in sklearn to measure the quality of a split is Mean Squared Error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node. You can read more [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

In [6]:
tree_model = DecisionTreeRegressor(random_state=0).fit(X_train, y_train)
print('Decision tree on train set:\n', evaluation(y_train.to_numpy(), tree_model.predict(X_train)))
print('Decision tree on test set:\n', evaluation(y_test.to_numpy(), tree_model.predict(X_test)))

Decision tree on train set:
 {'MSE': 4.962948389259151, 'ACC': 0.9446272714265321}
Decision tree on test set:
 {'MSE': 26.04377182530149, 'ACC': 0.6652202937249643}


<a id='improve'></a>
## 2. Improving model performance 

There are numerous ways in how we could improve model performances. For example, 

- Try different models 
- Feature engineering 
- Regularization
- Hyperparameter tuning through cross validation
...

We will provide code examples for **hyperparameter tuning through cross validation** in the cells below.

You might notice that using decision tree regressor, the models performs much better on the train set than the test set. Such a problem is called **overfitting** and is common in practice. The reason for overfitting is that the model has learnt from the train set too good to an extent that the model cannot be generalized to other data. The figure below is a great illustration for underfitting and overfitting. 

How do we know we have overfit the train data? An alarming signal is high train performance but low test performance. 

<img src="fig/overfit.png" width=600 height=400 />

How do we solve overfitting? If we cannot get a bigger and richer data set, we could make the model 'simpler', that is, to decrease the complexity of the model. Using decision tree regressor as an example, we could decrease the max depth of the tree so that we get a smaller tree by stopping early. 

But then you might come up with the question: how shall I decide the optimal depth? The answer to your question is through **cross-validation**. Say we are doing k-fold cross validation. Then we are splitting the data into k groups, and each time we fit the Split the dataset into k groups. Each group will be used as the test set once while a model with specified hyperparameters is fitted on the remaning k-1 groups; the model performance would be evaluted on the test set. **The aim of cross validation is not to fit a bunch of models. But instead is for us to find the best hyperparameters**. 

<img src="fig/cv.png" width=600 height=400 />

The codes below is an exmpale of 5-fold cross validtion, as illustrated in the figure above. 
Here, instead of normalizing the entire training set using the feature_engineering function we created in section 4, we wrap up normalization and model into a single pipeline and pass it into GridSearchCV. The reason is that each fold(group) will now be used as a test set in cross validation, and it should be normalized based on the mean & variance of other 4 folds, instead of based on the mean & varaince of the entire training set. 

In our example, the hyperparameter that we are trying to tune is max depth of a decision tree; the values we are trying out are 10, 50 and 100. 

In [7]:
X_train, X_test, y_train, y_test = feature_engineering(cmg, normalize = False)

In [8]:
# preprocessor 
numerical_cols = list(X_train.select_dtypes(include=np.number))
categorical_cols = [col for col in list(X_train) if col not in numerical_cols]
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
        transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])
# model 
model = DecisionTreeRegressor(random_state=0)

In [9]:
# putting everything together 
pipe = Pipeline(steps=[("pre", preprocessor), ("model", model)])
param_grid = {
    'model__max_depth':[10, 50, 100]}
tree_model_cv = GridSearchCV(pipe, param_grid, n_jobs=1, scoring='neg_mean_squared_error', cv = 5, refit = True)
tree_model_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('pre',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['offeringDiscountToLastTrade',
                                                                          'offeringPrice',
                                                                          'pre15_Price_Normalized',
                                                                          'pre14_Price_Normalized',
                                                                          'pre13_Price_Normalized',
                                                                          'pre12_Price_Normalized',
                                                                          'pre11_Price_Normalized',
                                                                   

Let's print out the results from cross validation, as explained above, since we are doing 5-fold cross validtion, the train set is split into 5 groups. For each hyperparameter combination(here, we are just tuning one single hyperparameter, max_depth), each group is used as the test set for a model with the specified hyperparameter combination fitted on the other four groups. And then we get a mean test score for each hyperparmeter combination by taking the average over test scores evaluated on the five groups. We would choose max_depth = 10 since it has the lowest mean test score in cross validation. 

In [10]:
pd.DataFrame(tree_model_cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.079575,0.004995,0.006127,0.000575,10,{'model__max_depth': 10},-37.313538,-28.952039,-10.745832,-8.398568,-19.315139,-20.945023,10.919104,1
1,0.195951,0.009761,0.006299,0.000529,50,{'model__max_depth': 50},-38.773212,-29.021863,-12.560251,-9.632726,-19.919786,-21.981567,10.742729,3
2,0.192644,0.0141,0.006214,0.000746,100,{'model__max_depth': 100},-38.519847,-27.760295,-13.133614,-9.63151,-18.821158,-21.573285,10.458234,2


After tuning, our decision tree model performs much better on the test set in terms of MSE. 

In [11]:
print('Decision tree tuned on train set:\n', evaluation(y_train.to_numpy(), tree_model_cv.predict(X_train)))
print('Decision tree tuned on test set:\n', evaluation(y_test.to_numpy(), tree_model_cv.predict(X_test)))

Decision tree tuned on train set:
 {'MSE': 10.917166645380743, 'ACC': 0.6733600400601004}
Decision tree tuned on test set:
 {'MSE': 14.739607806319494, 'ACC': 0.6636955433149685}
