# **Tuning Hyperparameters**

## Objectives

* We will tune the hyperparameters of for Logistic Regression, Ada Boost, and Gradient Boosting models.

## Inputs

* Training and Testing data sets from notebook 04.
* Insights developed in the previous notebook.

## Outputs

* We will have saved models with tuned hyper parameters at the end of this notebook.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Pipeline and Grid Search set up
We recall the code for building our pipelines and the grid search that we performed in the last notebook. Note that some of the constants have changed. 

We have modified the pipeline to see how feateure selection impacts the performance. Note that setting `thresh=1` essentially removes the `'corr_selector'` step from the pipeline. We will eventually remove this step from the pipeline once we have selected a value for `thresh`.

In [6]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
TO_DROP = ['ftm_away', 'plus_minus_home', 'fg3m_away', 'pts_away', 'play_off',
           'fgm_away', 'pts_home', 'fg3m_home', 'ftm_home', 'fgm_home',
           'season']
THRESH = 0.6
TRANSFORMS = {'box_cox':(vt.BoxCoxTransformer,False),
              'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away',
                    'dreb_home', 'ast_home', 'stl_away', 'stl_home',
                    'reb_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(thresh=THRESH):
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=TO_DROP)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    dropping = pipeline['corr_selector'].features_to_drop_
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

In [3]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier


MODELS = {
    'LogisticRegression': LogisticRegression,
    'GradientBoosting': GradientBoostingClassifier,
    'AdaBoost': AdaBoostClassifier,
}


def add_feat_selection_n_model(pipeline,model_name,random_state=42, feature_selection=True):
    model = MODELS[model_name](random_state=random_state)
    if feature_selection:
        pipeline.steps.append(("feat_selection", SelectFromModel(model)))
    pipeline.steps.append(('model',model))
    pipeline.model_type = model_name
    return pipeline

Let us now create our list of models that we are going to train.

In [7]:
def create_pipelines(model_name,thresh_range=[THRESH], feature_selection=True):
    PIPELINES = {}
    for thresh in thresh_range:
        base_pipe = base_pipeline(thresh)
        base_pipe.fit(X_TrainSet)
        pipe_w_transforms= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
        PIPELINES[model_name+'_'+str(thresh)] = add_feat_selection_n_model(pipe_w_transforms,
                                                           model_name,
                                                           feature_selection)
    return PIPELINES



Next, we have the code for our grid search. As we will be treating `thresh` as a hyperparameter, it will be slightly different.

In [12]:
from sklearn.model_selection import GridSearchCV
# to suppress warnings, I think
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)
os.environ['PYTHONWARNINGS']='ignore'


def grid_search(X_train, y_train, pipelines,param_grid={}):
    GRIDS = {}
    count = 0
    for pipe_name_thresh, pipe in pipelines.items():
        pipe_name, thresh = pipe_name_thresh.split('_')
        print(f"### Beginning grid search for {pipe_name} with correlation threshold={thresh} ###") 
        grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring=['accuracy','precision'],
                    refit='precision')
        grid.fit(X_train,y_train)
        GRIDS[pipe_name_thresh] = grid
        count +=1
        print(f"Finished with model {count}.")
        print(f"{len(pipelines.keys())-count} models remaining.")
    return GRIDS


## Section 2: Selecting Hyperparameters
We will look at the different possible hyperparameters for each model.

### Subsection 2.1: Logistic Regression
Logistic Regression models have many hyperparameters. We will focus on:
* `penalty` is a regularization parameter.
* `solver` specifies the type of algorithm used.
* `C` controls the strength of the penalty.

Not all penalties work for each solver. We will try the following hyperparameters.

In [10]:
C = [10**(i) for i in range(-3,4)]
solvers = ['newton-cg', 'newton-cholesky', 'lbfgs', 'liblinear', 'sag', 'saga']
penaltys = ['none', 'l1', 'l2', 'elasticnet']

logistic_param_grid = [
            {'C':C,
             'solver':['lbfgs','newton-cg','newton-cholesky','sag'],
             'penalty':['l2',None]},
            {'C':C,
             'solver':['liblinear'],
             'penalty':['l1','l2']},
            {'C':C,
             'solver':['saga'],
             'penalty':['l1','l2',None,'elasticnet']}
              ]

Now we are ready to do the grid search. We expect this to go well since Logistic Regression was our best performing model without any tuning.

In [17]:
thresholds = [0.05*i for i in range(10,18)]
logistic_pipes = create_pipelines('LogisticRegression',thresh_range=thresholds, feature_selection=True)
logistic_grids = grid_search(X_TrainSet, Y_TrainSet, logistic_pipes, param_grid=logistic_param_grid)
sample = logistic_pipes['LogisticRegression_0.5']
sample['model'].__dict__


### Beginning grid search for LogisticRegression with correlation threshold=0.5 ###
Fitting 5 folds for each of 98 candidates, totalling 490 fits


ValueError: Invalid parameter 'C' for estimator Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['ftm_away', 'plus_minus_home',
                                                'fg3m_away', 'pts_away',
                                                'play_off', 'fgm_away',
                                                'pts_home', 'fg3m_home',
                                                'ftm_home', 'fgm_home',
                                                'season'])),
                ('corr_selector',
                 SmartCorrelatedSelection(selection_method='variance',
                                          threshold=0.5,
                                          variables=['fga_home', 'fg3a_home',
                                                     'fta_home', 'oreb_home',
                                                     'dreb_home', 'reb_hom...
                                                     'stl_away', 'blk_away',
                                                     'tov_away', 'pf_away'])),
                ('yeo_johnson',
                 YeoJohnsonTransformer(variables=['blk_home', 'fta_away',
                                                  'ast_home', 'reb_away'])),
                ('box_cox',
                 BoxCoxTransformer(variables=['ast_away', 'fta_home'])),
                ('scaler', StandardScaler()),
                ('feat_selection',
                 SelectFromModel(estimator=LogisticRegression(random_state=True))),
                ('model', LogisticRegression(random_state=True))]). Valid parameters are: ['memory', 'steps', 'verbose'].

### Section 2.1: All features
We will now do a grid search using all features. We will then look at which features the models found most important.

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
