# **Tuning Hyperparameters**

## Objectives

* We will tune the hyperparameters of for Logistic Regression, Ada Boost, and Gradient Boosting models.

## Inputs

* Training and Testing data sets from notebook 04.
* Insights developed in the previous notebook.

## Outputs

* We will have saved models with tuned hyper parameters at the end of this notebook.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Pipeline and Grid Search set up
We recall the code for building our pipelines and the grid search that we performed in the last notebook. Note that some of the constants have changed. 

We have modified the pipeline to see how feateure selection impacts the performance. Note that setting `thresh=1` essentially removes the `'corr_selector'` step from the pipeline. We will eventually remove this step from the pipeline once we have selected a value for `thresh`.

In [6]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
TO_DROP = ['ftm_away', 'plus_minus_home', 'fg3m_away', 'pts_away', 'play_off',
           'fgm_away', 'pts_home', 'fg3m_home', 'ftm_home', 'fgm_home',
           'season']
THRESH = 0.6
TRANSFORMS = {'box_cox':(vt.BoxCoxTransformer,False),
              'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away',
                    'dreb_home', 'ast_home', 'stl_away', 'stl_home',
                    'reb_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(thresh=THRESH):
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=TO_DROP)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    dropping = pipeline['corr_selector'].features_to_drop_
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

In [3]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier


MODELS = {
    'LogisticRegression': LogisticRegression,
    'GradientBoosting': GradientBoostingClassifier,
    'AdaBoost': AdaBoostClassifier,
}


def add_feat_selection_n_model(pipeline,model_name,random_state=42, feature_selection=True):
    model = MODELS[model_name](random_state=random_state)
    if feature_selection:
        pipeline.steps.append(("feat_selection", SelectFromModel(model)))
    pipeline.steps.append(('model',model))
    pipeline.model_type = model_name
    return pipeline

Let us now create our list of models that we are going to train.

In [32]:
def create_pipe(model_name, feature_selection=True):
    base_pipe = base_pipeline()
    base_pipe.fit(X_TrainSet)
    pipe_w_transforms= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
    pipe = add_feat_selection_n_model(pipe_w_transforms,
                                      model_name,
                                      feature_selection)
    pipe.name = model_name
    return pipe



Next, we have the code for our grid search. As we will be treating `thresh` as a hyperparameter, it will be slightly different.

In [33]:
from sklearn.model_selection import GridSearchCV
# to suppress warnings, I think
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)
os.environ['PYTHONWARNINGS']='ignore'


def grid_search(X_train, y_train,pipe,param_grid={},verbosity=1):
    print(f"### Beginning grid search for {pipe.name} ###") 
    grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=verbosity,
                    scoring=['accuracy','precision'],
                    refit='precision')
    grid.fit(X_train,y_train)
    return grid


## Section 2: Selecting Hyperparameters
We will look at the different possible hyperparameters for each model.

### Subsection 2.1: Logistic Regression
Logistic Regression models have many hyperparameters. We will focus on:
* `penalty` is a regularization parameter.
* `solver` specifies the type of algorithm used.
* `C` controls the strength of the penalty.

Not all penalties work for each solver. We will try the following hyperparameters.

In [47]:
thresholds = [round(0.05*i,2) for i in range(12,21)]
C = [10**(i) for i in range(-3,4)]
solvers = ['newton-cg', 'newton-cholesky', 'lbfgs', 'liblinear', 'sag', 'saga']
penaltys = ['none', 'l1', 'l2', 'elasticnet']
param_grid = [
            {'C':C,
             'solver':['lbfgs','newton-cg','newton-cholesky','sag'],
             'penalty':['l2',None]},
            {'C':C,
             'solver':['liblinear'],
             'penalty':['l1','l2']},
            {'C':C,
             'solver':['saga'],
             'penalty':['l1','l2',None,'elasticnet']}
              ]
logistic_param_grid = [{'model__'+key:value 
                                for key,value in param_dict.items()}
                                for param_dict in param_grid]
for params in logistic_param_grid:
    params['corr_selector__threshold']=thresholds


Now we are ready to do the grid search. We expect this to go well since Logistic Regression was our best performing model without any tuning. This initial training will help us establish a range to further tune the hyperparamters in.

In [67]:
logistic_pipe = create_pipe('LogisticRegression', feature_selection=True)
logistic_grids = grid_search(X_TrainSet, Y_TrainSet, logistic_pipe, param_grid=logistic_param_grid, verbosity=2)

### Beginning grid search for LogisticRegression###
Fitting 5 folds for each of 882 candidates, totalling 4410 fits
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=l2, model__solver=newton-cholesky; total time=   2.5s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=l2, model__solver=newton-cholesky; total time=   2.6s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=l2, model__solver=lbfgs; total time=   2.7s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=None, model__solver=lbfgs; total time=   2.8s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=None, model__solver=newton-cg; total time=   3.0s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=l2, model__solver=newton-cg; total time=   3.1s
[CV] END corr_selector__threshold=0.6, model__C=0.001, model__penalty=None, model__solver=lbfgs; total time=   3.1s
[CV] END corr_selector__threshold=0.6, model__C=0.00

KeyboardInterrupt: 

Let's look at what the best choices are at this stage.

In [56]:
#from src.model_eval import get_best_params
def get_best_params(grid):
    res = (pd.DataFrame(grid.cv_results_)
         .sort_values(by=['mean_test_precision','mean_test_accuracy'],ascending=False)
         .filter(['params','mean_test_precision','mean_test_accuracy'])
         .values)
    intro = f"Best parameters for current model:"
    print(intro)
    for i in range(10):
        params = res[i][0]
        for key,value in params.items():
            print(f"{key.split('__')[1]}: {value}")
        print(f"Avg. Precision: {res[0][1]*100}%.")
        print(f"Avg. Accuracy: {res[0][2]*100}%.")
        print()

get_best_params(logistic_grids)

Best parameters for current model:
threshold: 0.8
C: 0.01
penalty: l1
solver: liblinear
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: sag
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cholesky
Avg. Precision: 88.742159592

Clearly, the correlation threshold of 0.8 is best. We will examine a small neighborhood of this further. For `C`, the best range seems to be between 0.01 and 0.001. We will also look at a neighborhood of this. Don't forget that the above is with the feature selection step of the pipeline. 

Let's examine if removing that step will improve performance.

In [None]:
logistic_pipe_wo_select = create_pipe('LogisticRegression', feature_selection=False)
logistic_grids_wo_select = grid_search(X_TrainSet, Y_TrainSet, logistic_pipe_wo_select, param_grid=logistic_param_grid, verbosity=2)

### Beginning grid search for LogisticRegression###
Fitting 5 folds for each of 882 candidates, totalling 4410 fits


In [62]:
get_best_params(logistic_grids)


Best parameters for current model:
threshold: 0.8
C: 0.01
penalty: l1
solver: liblinear
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: sag
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cholesky
Avg. Precision: 88.742159592

The performance of the best estimators is identical (check this)

In [66]:
print(logistic_grids.cv_results_['mean_fit_time'].mean())
print(logistic_grids.cv_results_['mean_fit_time'].std())


9.581649013873943
1.6498879688367896
0.8004611973625935
0.35958273424097664


This does not seem so bad for a training time.

In [None]:
best_logistic_pipe = logistic_grid.best_estimator_
from src.model_eval import clf_performance

Now let's see what the best choices of hyperparameters are for the Logistic Regression model.

In [63]:

get_best_params(logistic_grids)
print(logistic_grids.cv_results_.keys())

Best parameters for current model:
threshold: 0.8
C: 0.01
penalty: l1
solver: liblinear
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.001
penalty: None
solver: sag
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: lbfgs
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cg
Avg. Precision: 88.74215959289508%.
Avg. Accuracy: 87.2433806846207%.

threshold: 0.8
C: 0.01
penalty: None
solver: newton-cholesky
Avg. Precision: 88.742159592

### Subsection 2.2: AdaBoost
We will now do a grid search with AdaBoost. After an initial search to determine a range, we will investigate more closely.

With AdaBoost, we can vary the base estimator. The default is a Decision tree, but we have seen how successful Logistic Regression has been, so we will also try that.

In [49]:
base_logistic = LogisticRegression(random_state=42)

model = AdaBoostClassifier(base_estimator=base_logistic,random_state=42)
model.fit(X_TrainSet,Y_TrainSet)

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
