# Section 5. Modeling 

### CONTENTS
* <a href='04- DSC 2022 Feature Engineering .ipynb#top'>**Section 4. Feature Engineering**</a> 
* <a href='05- DSC 2022 Modeling .ipynb#top'>**Section 5. Modeling**</a>
  * [1. Machine learning](#ml)
  * [2. Improving machine learning](#improve)
  * [3. Model explanability](#improve)

In this notebook, we will be showing examples of building machine learning models, improving model performances through hyperparameter tuning as well as model explanability. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
import warnings
from feature_engineering import feature_engineering
from evaluation import evaluation

In [2]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
X_train, X_test, y_train, y_test = feature_engineering(cmg)

The metric we will measure you on are __[Mean Absolute Error](https://en.wikipedia.org/wiki/Mean_absolute_error)__(__MAE__), the equation is:
\begin{equation*}
MAE   = \frac{\sum_{i=1}^{N}|y_{true}-y_{pred}|}{N}
\end{equation*}

Prior to fitting any models, it is import to develop a baseline. 

In [3]:
print(evaluation(y_train.to_numpy(), np.repeat(y_train.mean().to_numpy().reshape(1, -1), y_train.shape[0], axis = 0)))
print(evaluation(y_test.to_numpy(), np.repeat(y_test.mean().to_numpy().reshape(1, -1), y_test.shape[0], axis = 0)))

{'MAE': 0.8099266522052686, 'ACC': 0.6208128804241142}
{'MAE': 0.9431166922629328, 'ACC': 0.6050058892815046}


<a id='ml'></a>
## 1. Machine Learning 

In the previous time series example, we only used the time series features in the data set. However, we have additional info in the data set. Why don't we make use of them?

### Linear Regression

Looks like the linear regression model is worse than our naive baseline. 

In [4]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression().fit(X_train, y_train)

print(evaluation(y_train.to_numpy(), lr_model.predict(X_train)))
print(evaluation(y_test.to_numpy(), lr_model.predict(X_test)))

{'MAE': 0.863177581246797, 'ACC': 0.5705870803063041}
{'MAE': 1.0169777863399454, 'ACC': 0.5643698468786792}


### K-nearest neighbors 

In [6]:
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor().fit(X_train, y_train)
print('KNN on train set:\n', evaluation(y_train.to_numpy(), knn_model.predict(X_train)))
print('KNN on test set:\n', evaluation(y_test.to_numpy(), knn_model.predict(X_test)))

KNN on train set:
 {'MAE': 0.6188757294856497, 'ACC': 0.6882387590810978}
KNN on test set:
 {'MAE': 0.8255116504504054, 'ACC': 0.5836866902237886}


### Tree methods 

After applying the tree methods, we discover 

In [8]:
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor(random_state=0).fit(X_train, y_train)
print('Decision tree on train set:\n', evaluation(y_train.to_numpy(), tree_model.predict(X_train)))
print('Decision tree on test set:\n', evaluation(y_test.to_numpy(), tree_model.predict(X_test)))

Decision tree on train set:
 {'MAE': 0.3249011513490006, 'ACC': 0.9357549577851982}
Decision tree on test set:
 {'MAE': 0.8369329443934509, 'ACC': 0.5494699646643096}


<a id='improve'></a>
## 2. Improving model performance 

- Different models 
- Feature engineering 
- Regularizaition
- Hyperparameter tuning through cross validation
...

<img src="fig/overfit.png" width=600 height=400 />

<img src="fig/cv.png" width=600 height=400 />

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV

In [10]:
X_train, X_test, y_train, y_test = feature_engineering(cmg, normalize = False)

In [11]:
# preprocessor 
numerical_cols = list(X_train.select_dtypes(include=np.number))
categorical_cols = [col for col in list(X_train) if col not in numerical_cols]
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
        transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])
# model 
model = DecisionTreeRegressor(random_state=0)

In [12]:
# putting everything together 
pipe = Pipeline(steps=[("pre", preprocessor), ("model", model)])
param_grid = {
    'model__max_depth':[10, 50, 100], 
    'model__min_samples_leaf': [2,8,32]}
tree_model_cv = GridSearchCV(pipe, param_grid,n_jobs=1, scoring='neg_mean_absolute_error', cv = 5, refit = True)
tree_model_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('pre',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['offeringDiscountToLastTrade',
                                                                          'offeringPrice',
                                                                          'pre15_Price_Normalized',
                                                                          'pre14_Price_Normalized',
                                                                          'pre13_Price_Normalized',
                                                                          'pre12_Price_Normalized',
                                                                          'pre11_Price_Normalized',
                                                                   

In [13]:
pd.DataFrame(tree_model_cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__max_depth,param_model__min_samples_leaf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.055397,0.005847,0.005061,0.000443,10,2,"{'model__max_depth': 10, 'model__min_samples_l...",-0.717843,-0.599893,-0.842228,-0.876997,-0.740543,-0.755501,0.098121,2
1,0.054909,0.00293,0.005634,0.000368,10,8,"{'model__max_depth': 10, 'model__min_samples_l...",-0.827358,-0.568586,-0.733282,-0.879687,-0.740199,-0.749822,0.105915,1
2,0.051647,0.000964,0.005323,0.000177,10,32,"{'model__max_depth': 10, 'model__min_samples_l...",-0.788514,-0.676042,-0.777624,-0.915088,-0.765418,-0.784537,0.076477,5
3,0.126682,0.008508,0.004972,0.00026,50,2,"{'model__max_depth': 50, 'model__min_samples_l...",-0.786142,-0.656231,-0.896415,-0.975363,-0.81137,-0.825104,0.107599,8
4,0.101621,0.018593,0.005569,0.000742,50,8,"{'model__max_depth': 50, 'model__min_samples_l...",-0.844816,-0.593713,-0.762839,-0.921025,-0.770291,-0.778537,0.108784,3
5,0.064107,0.00862,0.004589,5.1e-05,50,32,"{'model__max_depth': 50, 'model__min_samples_l...",-0.788743,-0.679221,-0.780706,-0.922504,-0.771453,-0.788525,0.077782,6
6,0.121205,0.008119,0.004983,0.000115,100,2,"{'model__max_depth': 100, 'model__min_samples_...",-0.786142,-0.657505,-0.920204,-0.971518,-0.816655,-0.830405,0.109562,9
7,0.088713,0.007724,0.004693,0.000139,100,8,"{'model__max_depth': 100, 'model__min_samples_...",-0.844816,-0.593713,-0.762839,-0.921025,-0.770291,-0.778537,0.108784,3
8,0.063598,0.008925,0.00476,0.000204,100,32,"{'model__max_depth': 100, 'model__min_samples_...",-0.788743,-0.679221,-0.780706,-0.922504,-0.771453,-0.788525,0.077782,6


In [14]:
print('Decision tree tuned on train set:\n', evaluation(y_train.to_numpy(), tree_model_cv.predict(X_train)))
print('Decision tree tuned on test set:\n', evaluation(y_test.to_numpy(), tree_model_cv.predict(X_test)))

Decision tree tuned on train set:
 {'MAE': 0.6774857126774676, 'ACC': 0.6450814843903426}
Decision tree tuned on test set:
 {'MAE': 0.8163092159468363, 'ACC': 0.6262661955241434}


Remind that there are additional market data that can be included, essentially the price for SPY since 2013 onwards. 

## 3. Model Explanability 

In [84]:
from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(tree_model, filled=True)

Error in callback <function flush_figures at 0x7ff4d06931f0> (for post_execute):


KeyboardInterrupt: 