# **Tuning Hyperparameters**

## Objectives

* We will tune the hyperparameters of for Logistic Regression, Ada Boost, and Gradient Boosting models.

## Inputs

* Training and Testing data sets from notebook 04.
* Insights developed in the previous notebook.

## Outputs

* We will have saved models with tuned hyper parameters at the end of this notebook.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'datasets/train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'datasets/test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Pipeline and Grid Search set up
We recall the code for building our pipelines and the grid search that we performed in the last notebook. Note that some of the constants have changed. 

We have modified the pipeline to see how feateure selection impacts the performance. Note that setting `thresh=1` essentially removes the `'corr_selector'` step from the pipeline. We will eventually remove this step from the pipeline once we have selected a value for `thresh`.

In [3]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
TO_DROP = ['ftm_away', 'plus_minus_home', 'fg3m_away', 'pts_away', 'play_off',
           'fgm_away', 'pts_home', 'fg3m_home', 'ftm_home', 'fgm_home',
           'season']
THRESH = 0.6
TRANSFORMS = {'box_cox':(vt.BoxCoxTransformer,False),
              'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away',
                    'dreb_home', 'ast_home', 'stl_away', 'stl_home',
                    'reb_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(thresh=THRESH):
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=TO_DROP)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    dropping = pipeline['corr_selector'].features_to_drop_
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

In [4]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier


MODELS = {
    'LogisticRegression': LogisticRegression,
    'GradientBoosting': GradientBoostingClassifier,
    'AdaBoost': AdaBoostClassifier,
}

def create_pipe(model_name, feature_selection=True, random_state=42):
    model = MODELS[model_name](random_state=random_state)
    base_pipe = base_pipeline()
    base_pipe.fit(X_TrainSet)
    pipe= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
    if feature_selection:
        pipe.steps.append(("feat_selection", SelectFromModel(model)))
    pipe.steps.append(('model',model))
    pipe.model_type = model_name
    pipe.name = model_name
    return pipe


Next, we have the code for our grid search. As we will be treating `thresh` as a hyperparameter, it will be slightly different.

In [5]:
from sklearn.model_selection import GridSearchCV
# to suppress warnings, I think
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)
os.environ['PYTHONWARNINGS']='ignore'


def grid_search(X_train, y_train,pipe,param_grid={},verbosity=1):
    print(f"### Beginning grid search for {pipe.name} ###") 
    grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=verbosity,
                    scoring=['accuracy','precision'],
                    refit='precision')
    grid.fit(X_train,y_train)
    return grid


## Section 2: Selecting Hyperparameters
We will look at the different possible hyperparameters for each model.

### Subsection 2.1: Logistic Regression
Logistic Regression models have many hyperparameters. We will focus on:
* `penalty` is a regularization parameter.
* `solver` specifies the type of algorithm used.
* `C` controls the strength of the penalty.

Not all penalties work for each solver.

These will be our initial choice for hyperparameters. They will help us narrow down our search and find other ranges to test.

In [6]:
thresholds = [round(0.1*i,2) for i in range(5,11)]
C = [10**(2*i+1) for i in range(-2,2)]
solver = ['newton-cg', 'newton-cholesky', 'lbfgs', 'liblinear', 'sag', 'saga']
penalty = ['none', 'l1', 'l2', 'elasticnet']
param_grid = [
            {'C':C,
             'solver':['lbfgs','newton-cg','newton-cholesky','sag'],
             'penalty':['l2',None]},
            {'C':C,
             'solver':['liblinear'],
             'penalty':['l1','l2']},
            {'C':C,
             'solver':['saga'],
             'penalty':['l1','l2',None,'elasticnet']}
              ]
logistic_param_grid = [{'model__'+key:value 
                                for key,value in param_dict.items()}
                                for param_dict in param_grid]
for params in logistic_param_grid:
    params['corr_selector__threshold']=thresholds


Now we are ready to do the grid search. We expect this to go well since Logistic Regression was our best performing model without any tuning. This initial training will help us establish a range to further tune the hyperparamters in.

In [62]:
from src.utils import save_df, get_df

try:
    logistic_grids_w_select_df_v1 = get_df('logistic_grids_w_select_results_v1',
                                        'experiment_results/tuning/grids')
except FileNotFoundError:
    logistic_pipe_w_select = create_pipe('LogisticRegression', feature_selection=True)
    logistic_grids_w_select = grid_search(X_TrainSet, Y_TrainSet, logistic_pipe_w_select, param_grid=logistic_param_grid, verbosity=4)
    
    logistic_grids_w_select_df_v1 = pd.DataFrame(logistic_grids_w_select.cv_results_)
    save_df(logistic_grids_w_select_df_v1,
        'logistic_grids_w_select_results_v1',
        'experiment_results/tuning/grids')


### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 350 candidates, totalling 1750 fits
[CV 2/5] END corr_selector__threshold=0.78, model__C=0.001, model__penalty=None, model__solver=lbfgs; accuracy: (test=0.866) precision: (test=0.879) total time=   2.1s
[CV 4/5] END corr_selector__threshold=0.78, model__C=0.001, model__penalty=None, model__solver=lbfgs; accuracy: (test=0.874) precision: (test=0.891) total time=   2.2s
[CV 1/5] END corr_selector__threshold=0.78, model__C=0.001, model__penalty=None, model__solver=lbfgs; accuracy: (test=0.870) precision: (test=0.885) total time=   5.7s
[CV 5/5] END corr_selector__threshold=0.78, model__C=0.001, model__penalty=None, model__solver=lbfgs; accuracy: (test=0.878) precision: (test=0.895) total time=   5.4s
[CV 4/5] END corr_selector__threshold=0.78, model__C=0.001, model__penalty=l2, model__solver=saga; accuracy: (test=0.869) precision: (test=0.868) total time=   6.6s
[CV 3/5] END corr_selector__threshold=0.78, mo


Let's look at what the best choices are at this stage.

In [12]:
from src.model_eval import get_best_params_df
import ast


def get_best_params_df(df):
    res = (df.sort_values(by=['mean_test_precision','mean_test_accuracy'],ascending=False)
           .filter(['params','mean_test_precision','mean_test_accuracy'])
           .values)
    intro = f"Best parameters for current model:"
    print(intro)
    for i in range(3):
        params = ast.literal_eval(res[i][0])
        for key,value in params.items():
            print(f"{key.split('__')[1]}: {value}")
        print(f"Avg. Precision: {res[0][1]*100}%.")
        print(f"Avg. Accuracy: {res[0][2]*100}%.")
        print()

get_best_params_df(logistic_grids_w_select_df)

Best parameters for current model:
threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.



Clearly, the best correlation threshold is 0.8. The models have the same scores. Lets collect all of the parameters that have the same scores.

In [41]:
#from src.model_eval import collect_like_estimators
import math
def collect_like_estimators(results_df):
  grouped_dict = {}
  for _, result in results_df.iterrows():
    p_score = result['mean_test_precision']
    a_score = result['mean_test_accuracy']
    if math.isnan(p_score):
      p_score = -1
    if math.isnan(a_score):
      a_score = -1
    score = (p_score,a_score)
    if score in grouped_dict.keys():
      grouped_dict[score].append(result.to_dict())
    else:
      grouped_dict[score] = [result.to_dict()]
  return grouped_dict

grouped_estimator_params = collect_like_estimators(logistic_grids_w_select_df)
grouped_estimator_params.keys()

def parameter_stats(grouped_params):
  most = (0,0,0)
  maxim = (0,0)
  max_p = 0
  max_a = 0
  for key,value in grouped_params.items():
    if key[0]>max_p:
      max_p = key[0]
    if key[1]>max_a:
      max_a = key[1]
    if len(value)>most[2]:
      most = (key[0],key[1],len(value))
    if key[0]>maxim[0]:
      maxim = key
    elif key[0]==maxim[0] and key[1]>maxim[1]:
      maxim = key
  print('most',most)
  print('max',maxim,len(grouped_params[maxim]))
  print(f"{max_p=}, {max_a=}")
  return maxim

best_score = parameter_stats(grouped_estimator_params)


most (-1, -1, 76)
max (0.8870480236491776, 0.8718030112459308) 33
max_p=0.8870480236491776, max_a=0.871917694255368


The most common scores are `nan` which is strange
# attention, why is that
33 different choices of parameters had the best performance. We would like to see what these estimators had in common and look at neighborhoods around these parameters to see if we can improve the performance before moving on to the next model.

In [53]:
best_params = grouped_estimator_params[best_score]
param_count = {}
for param in best_params:
    param_dict = ast.literal_eval(param['params'])
    for key,value in param_dict.items():
        if (key,value) in param_count:
            param_count[(key,value)]+=1
        else:
            param_count[(key,value)]=1
for key,value in param_count.items():
    print(key,value)

('corr_selector__threshold', 0.8) 33
('model__C', 0.001) 5
('model__penalty', None) 20
('model__solver', 'lbfgs') 6
('model__solver', 'newton-cg') 6
('model__solver', 'newton-cholesky') 6
('model__solver', 'sag') 6
('model__C', 0.1) 5
('model__C', 10) 11
('model__penalty', 'l2') 11
('model__C', 1000) 12
('model__solver', 'liblinear') 1
('model__solver', 'saga') 8
('model__penalty', 'l1') 2


We will make the following modifications to `logistic_param_grid`:
* remove `'liblinear'` as only one model used it,
* pick a neighborhood around 0.8 for correlation threshold,
* focus on penalties `'l2'` and `'None'`

Since we are decreasing the number of combinations, we will redo the analysis with an enlarged value for `C`.

In [57]:
thresholds = [0.8+i*0.01 for i in range(-2,3)]
C = [10**i for i in range(-3,4)]
logistic_param_grid=[{'model__C': C,
  'model__solver': ['lbfgs', 'newton-cg', 'newton-cholesky', 'sag','saga'],
  'model__penalty': ['l2', None],
  'corr_selector__threshold': thresholds}]

In [58]:
try:
    logistic_grids_w_select_df_v2 = get_df('logistic_grids_w_select_results_v2',
                                        'experiment_results/tuning/grids')
except FileNotFoundError:
    logistic_pipe_w_select = create_pipe('LogisticRegression', 
                                         feature_selection=True)
    logistic_grids_w_select = grid_search(X_TrainSet, Y_TrainSet, 
                                          logistic_pipe_w_select, 
                                          param_grid=logistic_param_grid, 
                                          verbosity=4)
    
    logistic_grids_w_select_df_v2 = pd.DataFrame(logistic_grids_w_select.cv_results_)
    save_df(logistic_grids_w_select_df_v2,
        'logistic_grids_w_select_results_v2',
        'experiment_results/tuning/grids')


In [76]:
from src.model_eval import get_best_params

get_best_params()

ImportError: cannot import name 'get_best_params' from 'src.model_eval' (/workspace/pp5-ml-dashboard/src/model_eval.py)

Clearly, the correlation threshold of 0.8 is best. We will examine a small neighborhood of this further. For `C`, the best range seems to be between 0.01 and 0.001. We will also look at a neighborhood of this. Don't forget that the above is with the feature selection step of the pipeline. 

Let's examine if removing that step will improve performance.

In [13]:
from src.utils import save_df, get_df


try:
    logistic_grids_wo_select_df = get_df('logistic_grids_wo_select_results',
                                        'experiment_results/tuning/grids')
except FileNotFoundError:
    logistic_pipe_wo_select = create_pipe('LogisticRegression', feature_selection=False)
    logistic_grids_wo_select = grid_search(X_TrainSet, Y_TrainSet, logistic_pipe_wo_select, param_grid=logistic_param_grid, verbosity=4)
    
    logistic_grids_wo_select_df = pd.DataFrame(logistic_grids_wo_select.cv_results_)
    save_df(logistic_grids_wo_select_df,
        'logistic_grids_wo_select_results',
        'experiment_results/tuning/grids')


### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 336 candidates, totalling 1680 fits
[CV 4/5] END corr_selector__threshold=0.5, model__C=0.001, model__penalty=l2, model__solver=sag; accuracy: (test=nan) precision: (test=nan) total time=   1.9s
[CV 4/5] END corr_selector__threshold=0.5, model__C=0.001, model__penalty=l2, model__solver=newton-cholesky; accuracy: (test=nan) precision: (test=nan) total time=   2.1s
[CV 5/5] END corr_selector__threshold=0.5, model__C=0.001, model__penalty=l2, model__solver=newton-cg; accuracy: (test=nan) precision: (test=nan) total time=   2.4s
[CV 3/5] END corr_selector__threshold=0.5, model__C=0.001, model__penalty=None, model__solver=newton-cg; accuracy: (test=nan) precision: (test=nan) total time=   2.6s
[CV 1/5] END corr_selector__threshold=0.5, model__C=0.001, model__penalty=None, model__solver=lbfgs; accuracy: (test=nan) precision: (test=nan) total time=   2.6s
[CV 4/5] END corr_selector__threshold=0.5, model__C=0.001,

In [None]:
get_best_params(logistic_grids)


Best parameters for current model:
threshold: 0.8
C: 0.01
penalty: l1
solver: liblinear
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: sag
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cholesky
Avg. Precision: 88.742159592

The performance of the best estimators is identical (check this)

In [None]:
print(logistic_grids.cv_results_['mean_fit_time'].mean())
print(logistic_grids.cv_results_['mean_fit_time'].std())


9.581649013873943
1.6498879688367896
0.8004611973625935
0.35958273424097664


This does not seem so bad for a training time.

In [None]:
best_logistic_pipe = logistic_grid.best_estimator_
from src.model_eval import clf_performance

Now let's see what the best choices of hyperparameters are for the Logistic Regression model.

In [None]:

get_best_params(logistic_grids)
print(logistic_grids.cv_results_.keys())

Best parameters for current model:
threshold: 0.8
C: 0.01
penalty: l1
solver: liblinear
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: sag
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cholesky
Avg. Precision: 88.742159592

### Subsection 2.2: AdaBoost
We will now do a grid search with AdaBoost. After an initial search to determine a range, we will investigate more closely.

With AdaBoost, we can vary the base estimator. The default is a Decision tree, but we have seen how successful Logistic Regression has been, so we will also try that.

In [77]:
base_logistic = LogisticRegression(random_state=42)

model = AdaBoostClassifier(base_estimator=base_logistic,random_state=42)
model.fit(X_TrainSet,Y_TrainSet)



The default base estimator for AdaBoost is a Decision tree. the hyperparamters for that kind of model are
* `n_estimators` the max number of estimators
* `learning_rate` which weights the estimators
* `algorithm` of which there are two choices (one has been deprecated)


In [81]:
ada_params = {'corr_selector__threshold': [round(0.05*i,2) for i in range(12,21)],
              'model__n_estimators': [5*i for i in range(1,11)],
              'model__learning_rate': [0.5*i for i in range(1,10)],
              'model__algorithm': ['SAMME', 'SAMME.R']
}

In [82]:
ada_pipe_w_feat_selection = create_pipe('AdaBoost', feature_selection=True)

ada_grid = grid_search(X_TrainSet, Y_TrainSet, ada_pipe_w_feat_selection, param_grid=ada_params, verbosity=2)
print('='*20)
print('finish with feature selection case')
print('='*20)
from src.utils import save_df
target_dir = '/workspace/pp5-ml-dashboard/outputs/experiment_results/grid_search_cv'
ada_w_feat_selection_cv_res = pd.DataFrame(ada_grid.cv_results_)
save_df(ada_w_feat_selection_cv_res,'ada_w_feat_selection_cv_res', target_dir,)
#ada_boost_wo_feat_selection = create_pipe('AdaBoost', feature_selection=False)
#ada_grid_wo_feat_selection = grid_search(X_TrainSet, Y_TrainSet, ada_boost_wo_feat_selection, param_grid=ada_params, verbosity=2)

### Beginning grid search for AdaBoost ###
Fitting 5 folds for each of 1620 candidates, totalling 8100 fits
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=5; total time=  23.3s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=5; total time=  23.4s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=15; total time=  23.6s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=10; total time=  22.2s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=10; total time=  24.5s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estimators=5; total time=  23.6s
[CV] END corr_selector__threshold=0.6, model__algorithm=SAMME, model__learning_rate=0.5, model__n_estim