# **Tuning Hyperparameters**

## Objectives

* We will tune the hyperparameters of for Logistic Regression and an Adaptive Boost model.

## Inputs

* Training and Testing data sets from notebook 04.
* Insights developed in the previous notebook.

## Outputs

* We will have saved models with tuned hyper parameters at the end of this notebook.

## Additional Comments

* We are making some philosophical assumptions about the nature of hyperparameters. The basic assumption is that the performance of a model trained with hyperparameters that are "near enough" to each other will perform "similarly enough." This is the idea that the performance of the model depends _continuously_ on the hyperparameters. We in fact assume a certain amount of regularity of this dependence. In partial differential equations (pdes), the kind of behavior we are assuming is characteristic of elliptic pdes. We do not have a technical reason for believing this. We assume this is an active area of research for various models, but in general it is outside the scope of this project. It does influence our decision in how we go about searching for good hyperparameters.


---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'datasets/train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'datasets/test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Pipeline and Grid Search set up
We recall the code for building our pipelines and the grid search that we performed in the last notebook. Note that some of the constants have changed. 

We have modified the pipeline to see how feateure selection impacts the performance. Note that setting `thresh=1` essentially removes the `'corr_selector'` step from the pipeline. We will eventually remove this step from the pipeline once we have selected a value for `thresh`.

In [3]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
TO_DROP = ['ftm_away', 'plus_minus_home', 'fg3m_away', 'pts_away', 'play_off',
           'fgm_away', 'pts_home', 'fg3m_home', 'ftm_home', 'fgm_home',
           'season']
THRESH = 0.6
TRANSFORMS = {'box_cox':(vt.BoxCoxTransformer,False),
              'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away',
                    'dreb_home', 'ast_home', 'stl_away', 'stl_home',
                    'reb_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(thresh=THRESH):
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=TO_DROP)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    dropping = pipeline['corr_selector'].features_to_drop_
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

In [4]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier


MODELS = {
    'LogisticRegression': LogisticRegression,
    'GradientBoosting': GradientBoostingClassifier,
    'AdaBoost': AdaBoostClassifier,
}

def create_pipe(model_name, random_state=42, params={}):
    model = MODELS[model_name](random_state=random_state,**params)
    base_pipe = base_pipeline()
    base_pipe.fit(X_TrainSet)
    pipe= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
    pipe.steps.append(("feat_selection", SelectFromModel(model)))
    pipe.steps.append(('model',model))
    pipe.model_type = model_name
    pipe.name = model_name
    return pipe


Next, we have the code for our grid search. As we will be treating `thresh` as a hyperparameter, it will be slightly different.

In [5]:
from sklearn.model_selection import GridSearchCV
# to suppress warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)
os.environ['PYTHONWARNINGS']='ignore'


def grid_search(X_train, y_train,pipe,param_grid={},verbosity=1):
    print(f"### Beginning grid search for {pipe.name} ###") 
    grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=verbosity,
                    scoring=['accuracy','precision'],
                    refit='precision')
    grid.fit(X_train,y_train)
    return grid


## Section 2: Logistic Regression
Logistic Regression models have many hyperparameters. We will focus on:
* `penalty` is a regularization parameter.
* `solver` specifies the type of algorithm used.
* `C` controls the strength of the penalty.

Not all penalties work for each solver.

These will be our initial choice for hyperparameters. They will help us narrow down our search and find other ranges to test.

In [6]:
thresholds = [round(0.1*i,2) for i in range(5,11)]
C = [10**(2*i+1) for i in range(-2,2)]
solver = ['newton-cg', 'newton-cholesky', 'lbfgs', 'liblinear', 'sag', 'saga']
penalty = ['none', 'l1', 'l2', 'elasticnet']
param_grid = [
            {'C':C,
             'solver':['lbfgs','newton-cg','newton-cholesky','sag'],
             'penalty':['l2',None]},
            {'C':C,
             'solver':['liblinear'],
             'penalty':['l1','l2']},
            {'C':C,
             'solver':['saga'],
             'penalty':['l1','l2',None,'elasticnet']}
              ]
logistic_param_grid = [{'model__'+key:value 
                                for key,value in param_dict.items()}
                                for param_dict in param_grid]
for params in logistic_param_grid:
    params['corr_selector__threshold']=thresholds


Now we are ready to do the grid search. We expect this to go well since Logistic Regression was our best performing model without any tuning. This initial training will help us establish a range to further tune the hyperparamters in.

In [7]:
from src.utils import save_df, get_df


def get_grid_results_df(pipe, name, dir, param_grid={}, verbosity=2):
    try:
        results_df = get_df(name, dir)
    except FileNotFoundError:
        pipe_grid_search = grid_search(
            X_TrainSet, Y_TrainSet, pipe, param_grid=param_grid, verbosity=verbosity
        )
        results_df = pd.DataFrame(pipe_grid_search.cv_results_)
        save_df(results_df, name, dir)
        # this normalizes the types
        results_df = get_df(name, dir)
    return results_df


logistic_pipe = create_pipe("LogisticRegression")
results_name = "logistic_grid_results_v1"
dir = "experiment_results/tuning/grids"

logistic_results_df_v1 = get_grid_results_df(
    logistic_pipe, results_name, dir, param_grid=logistic_param_grid
)


Let's look at what the best choices are at this stage.

In [8]:
from src.model_eval import get_best_params_df

get_best_params_df(logistic_results_df_v1)

Best parameters for current model:
threshold: 0.8
C: 0.001
penalty: None
solver: lbfgs
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cg
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.8
C: 0.001
penalty: None
solver: newton-cholesky
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.



Clearly, the best correlation threshold is 0.8. The models have the same scores. Lets collect all of the parameters that have the same scores. We order the scores not by count but by score.

In [9]:
from src.model_eval import present_score_counts, score_stats


present_score_counts(logistic_results_df_v1)
best_score = score_stats(logistic_results_df_v1)


---  Score Counts  ---
Precision: 0.8870480236491776, Accuracy: 0.8718030112459308
Count: 34

Precision: 0.887042467962182, Accuracy: 0.8717743374108053
Count: 2

Precision: 0.8870152603635608, Accuracy: 0.8718030194666403
Count: 1

Precision: 0.8869924718004393, Accuracy: 0.8718890245305975
Count: 1

---  Score Stats  ---
Most Common: Precision: -1
             Accuracy: -1
             Count: 76
Max Score: Precision: 0.8870480236491776
           Accuracy: 0.8718030112459308
           Count: 34
Max Precision: 0.8870480236491776
Max Accuracy: 0.871917694255368


The most common scores are `nan` (we changed the score to be -1 if a `nan` value was showing up). This happens when a choice of parameters does not work well together. We won't worry about this as we are able to get quite good precision and accuracy with this first pass. Note that the best accuracy score is not far from the accuracy of the model with the best precision.

34 different choices of parameters had the best performance. We would like to see what these estimators had in common and look at neighborhoods around these parameters to see if we can improve the performance before moving on to the next model.

In [10]:
from src.model_eval import present_param_counts

present_param_counts(logistic_results_df_v1, best_score)

model__solver: saga, Count: 8
model__solver: lbfgs, Count: 6
model__solver: newton-cg, Count: 6
model__solver: newton-cholesky, Count: 6
model__solver: sag, Count: 5
model__solver: liblinear, Count: 3
model__penalty: None, Count: 20
model__penalty: l2, Count: 10
model__penalty: l1, Count: 4
model__C: 1000, Count: 13
model__C: 10, Count: 11
model__C: 0.001, Count: 5
model__C: 0.1, Count: 5
corr_selector__threshold: 0.8, Count: 34


We will make the following modifications to `logistic_param_grid`:
* remove `'liblinear'` and `'sag'` as they were used the least,
* pick a neighborhood around 0.8 for correlation threshold,
* focus on penalties `'l2'` and `'None'`
* focus on the range 1 to 1000 for `C`

We will see if focusing gives us any improvement in score.

In [11]:
from src.utils import divide_range

thresholds = divide_range(0.75,0.85,5)
C = divide_range(1,1000,6)
logistic_param_grid_v2=[{'model__C': C,
  'model__solver': ['lbfgs', 'newton-cg', 'newton-cholesky','saga'],
  'model__penalty': ['l2', None],
  'corr_selector__threshold': thresholds}]

In [12]:
logistic_results_df_v2 = get_grid_results_df(logistic_pipe,
                                             'logistic_results_df_v2', dir, 
                                             param_grid=logistic_param_grid_v2,
                                             verbosity=3)

Let's proceed by doing the analysis we did above of the results of this grid search.

In [13]:
get_best_params_df(logistic_results_df_v2)

Best parameters for current model:
threshold: 0.77
C: 1.0
penalty: None
solver: lbfgs
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.77
C: 1.0
penalty: None
solver: newton-cg
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.

threshold: 0.77
C: 1.0
penalty: None
solver: newton-cholesky
Avg. Precision: 88.70480236491775%.
Avg. Accuracy: 87.18030112459309%.



Again, very similar scores. So let's analyze the scores that showed up as we did before.

In [14]:
present_score_counts(logistic_results_df_v2)
best_score = score_stats(logistic_results_df_v2)

---  Score Counts  ---
Precision: 0.8870480236491776, Accuracy: 0.8718030112459308
Count: 156

Precision: 0.8868889708447109, Accuracy: 0.871716997961264
Count: 3

Precision: 0.8868481163465816, Accuracy: 0.8716883282364934
Count: 9

Precision: 0.8753149006289227, Accuracy: 0.8584414808786294
Count: 78

---  Score Stats  ---
Most Common: Precision: 0.8870480236491776
             Accuracy: 0.8718030112459308
             Count: 156
Max Score: Precision: 0.8870480236491776
           Accuracy: 0.8718030112459308
           Count: 156
Max Precision: 0.8870480236491776
Max Accuracy: 0.8718030112459308


Our most common score is our best score. It seems like we have chosen a good range of parameters since many of the combinations yield good results.

In [15]:
present_param_counts(logistic_results_df_v2, best_score)

model__solver: lbfgs, Count: 39
model__solver: newton-cg, Count: 39
model__solver: newton-cholesky, Count: 39
model__solver: saga, Count: 39
model__penalty: None, Count: 84
model__penalty: l2, Count: 72
model__C: 167.5, Count: 24
model__C: 334.0, Count: 24
model__C: 500.5, Count: 24
model__C: 667.0, Count: 24
model__C: 833.5, Count: 24
model__C: 1000.0, Count: 24
model__C: 1.0, Count: 12
corr_selector__threshold: 0.77, Count: 52
corr_selector__threshold: 0.79, Count: 52
corr_selector__threshold: 0.8099999999999999, Count: 52


All of the choices of parameters seem to be performing equally well, with insignificant exceptions. We will have to look at other metrics to determine distinguish between these choices of parameters. Things such as training time statistics, and standard deviation of the scores.

In [16]:
best_results = logistic_results_df_v2.query(f'mean_test_precision == {best_score[0]} and mean_test_accuracy == {best_score[1]}')
std_score_counts = {}

for _, row in best_results.iterrows():
    std_score = (row['std_test_precision'], row['std_test_accuracy'])
    if std_score in std_score_counts:
        std_score_counts[std_score] += 1
    else:
        std_score_counts[std_score] = 1
for key, value in std_score_counts.items():
    print(f"std_test_precision: {key[0]}"
          f"\nstd_test_accuracy: {key[1]}"
          f"\ncount: {value}")

std_test_precision: 0.0054844417109899
std_test_accuracy: 0.004118208175772
count: 156


So standard deviation will not help us distinguish either. This is annoying, but good. We will see how the models perform on the test data set.

Note: If you are tinkering and running cells multiple times, we recommend commenting out the code in the following three cells. They take a bit even after they have already been run the first time since we aren't saving the large number of pipelines and models.

In [17]:
import ast

def parameter_dicts(results_df, best_score):
    relevant = results_df.query(f'mean_test_precision == {best_score[0]} and mean_test_accuracy == {best_score[1]}')
    param_dicts = [ast.literal_eval(param_dict) for param_dict in relevant['params'].values]
    return param_dicts

best_params = parameter_dicts(logistic_results_df_v2, best_score)
'''
best_pipes = []
for param_dict in best_params:
    base_pipe = create_pipe('LogisticRegression')
    pipe = base_pipe.set_params(**param_dict)
    pipe.param_dict = param_dict
    best_pipes.append(pipe)

model_params = {key.split('__')[0]:key.split('__')[1]
                for key in best_params[0].keys()}
count = 0
for pipe in best_pipes:
    print(f"Pipe {count}:")
    for step in pipe.get_params()['steps']:
        if step[0] in model_params:
            param = model_params[step[0]]
            value = step[1].get_params()[param]
            print(f"{step[0]}"
                  f"\n{param}: {value}")
    print()
    count += 1
    if count>=2:
        break
'''

'\nbest_pipes = []\nfor param_dict in best_params:\n    base_pipe = create_pipe(\'LogisticRegression\')\n    pipe = base_pipe.set_params(**param_dict)\n    pipe.param_dict = param_dict\n    best_pipes.append(pipe)\n\nmodel_params = {key.split(\'__\')[0]:key.split(\'__\')[1]\n                for key in best_params[0].keys()}\ncount = 0\nfor pipe in best_pipes:\n    print(f"Pipe {count}:")\n    for step in pipe.get_params()[\'steps\']:\n        if step[0] in model_params:\n            param = model_params[step[0]]\n            value = step[1].get_params()[param]\n            print(f"{step[0]}"\n                  f"\n{param}: {value}")\n    print()\n    count += 1\n    if count>=2:\n        break\n'

We will now train all of the above pipelines and evaluate them on the test dataset.

In [18]:
'''count = 0
for pipe in best_pipes:
    print(f"Training pipe {count}:")
    print(pipe.param_dict)
    pipe.fit(X_TrainSet, Y_TrainSet)
    count+=1
    '''

'count = 0\nfor pipe in best_pipes:\n    print(f"Training pipe {count}:")\n    print(pipe.param_dict)\n    pipe.fit(X_TrainSet, Y_TrainSet)\n    count+=1\n    '

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

'''
def evaluate_param_on_test_set(pipe,X_test, Y_test):
    y_pred = pipe.predict(X_test)
    accuracy = accuracy_score(Y_test, y_pred)
    precision = precision_score(Y_test, y_pred)
    recall = recall_score(Y_test, y_pred)
    f1 = f1_score(Y_test, y_pred)
    return ((precision, accuracy, recall, f1), pipe.param_dict)


def evaluate_and_sort(fitted_pipes, X_test, Y_test):
    evaluations = [evaluate_param_on_test_set(pipe, X_test, Y_test)
               for pipe in fitted_pipes]
    evaluation_dict = {}
    for eval in evaluations:
        if eval[0] in evaluation_dict:
            evaluation_dict[eval[0]].append(eval[1])
        else:
            evaluation_dict[eval[0]] = [eval[1]]
    sorted_eval_dict = {k:v for k,v in sorted(evaluation_dict.items(),
                                         key=lambda item: item[0],
                                         reverse=True)}
    return sorted_eval_dict

sorted_evals = evaluate_and_sort(best_pipes, X_TestSet, Y_TestSet)
print(len(sorted_evals))
'''

'\ndef evaluate_param_on_test_set(pipe,X_test, Y_test):\n    y_pred = pipe.predict(X_test)\n    accuracy = accuracy_score(Y_test, y_pred)\n    precision = precision_score(Y_test, y_pred)\n    recall = recall_score(Y_test, y_pred)\n    f1 = f1_score(Y_test, y_pred)\n    return ((precision, accuracy, recall, f1), pipe.param_dict)\n\n\ndef evaluate_and_sort(fitted_pipes, X_test, Y_test):\n    evaluations = [evaluate_param_on_test_set(pipe, X_test, Y_test)\n               for pipe in fitted_pipes]\n    evaluation_dict = {}\n    for eval in evaluations:\n        if eval[0] in evaluation_dict:\n            evaluation_dict[eval[0]].append(eval[1])\n        else:\n            evaluation_dict[eval[0]] = [eval[1]]\n    sorted_eval_dict = {k:v for k,v in sorted(evaluation_dict.items(),\n                                         key=lambda item: item[0],\n                                         reverse=True)}\n    return sorted_eval_dict\n\nsorted_evals = evaluate_and_sort(best_pipes, X_TestSet, Y

It turns out that all of these best sets of parameters produce models that perform equally well with respect to the standard meterics. We will have to pick one to deploy. We will look at the time it took to score each model during the grid search.

In [20]:
time_results = best_results.filter(['mean_score_time','std_score_time','params'])

time_results = time_results.sort_values(by=['mean_score_time','std_score_time'])
print(time_results.head())
params_choice = time_results.iloc[0]['params']
print(params_choice)


     mean_score_time  std_score_time  \
81          0.322713        0.144490   
108         0.377752        0.131118   
135         0.378142        0.145327   
65          0.383024        0.131624   
184         0.399108        0.108866   

                                                params  
81   {'corr_selector__threshold': 0.77, 'model__C':...  
108  {'corr_selector__threshold': 0.77, 'model__C':...  
135  {'corr_selector__threshold': 0.79, 'model__C':...  
65   {'corr_selector__threshold': 0.77, 'model__C':...  
184  {'corr_selector__threshold': 0.809999999999999...  
{'corr_selector__threshold': 0.77, 'model__C': 500.5, 'model__penalty': 'l2', 'model__solver': 'newton-cg'}


We have the following choice of hyperparameters:
* correlation threshold: 0.77
* `C`: 500.5
* solver method: newton-cg
* penalty function: l2

Let's train the model and then look at the classification report. We don't need to list the penalty function since l2 is the default penalty function.

In [21]:
model_params = {'C':500.5, 'solver': 'newton-cg'}

final_logistic_pipe = create_pipe(model_name='LogisticRegression', params=model_params)
final_logistic_pipe.set_params(corr_selector__threshold=0.77)
final_logistic_pipe

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['ftm_away', 'plus_minus_home',
                                                'fg3m_away', 'pts_away',
                                                'play_off', 'fgm_away',
                                                'pts_home', 'fg3m_home',
                                                'ftm_home', 'fgm_home',
                                                'season'])),
                ('corr_selector',
                 SmartCorrelatedSelection(selection_method='variance',
                                          threshold=0.77,
                                          variables=['fga_home', 'fg3a_home',
                                                     'fta_home', 'oreb_home',
                                                     'dreb_home', 'reb_ho...
                 YeoJohnsonTransformer(variables=['blk_home', 'oreb_away',
                                                  'fta_away', 'ast_home

Let's now see the importance of the different features according to our final model.

### Section 3: AdaBoost
We will now do a grid search with AdaBoost. After an initial search to determine a range, we will investigate more closely.

With AdaBoost, there are two types of parameters. We have parameters for AdaBoost and parameters for the weak learner it is using as a base estimator. The default base estimator is a Decision tree.

AdaBoost has the following hyperparameters.
* `n_estimators` the max number of estimators
* `learning_rate` which weights the estimators
* `algorithm` of which there are two choices

Decision trees classifiers have the following hyperparameters.
* `max_depth`
* `min_samples_split`
* `min_leaf_split`

In order to get some early results, we will break the hyperparameter grid into smaller pieces along certain axes. We will first investigate how the hyperparameters of AdaBoost impact the models and then we will tune the hyperparameters of the inner Decision tree.

In [22]:
from src.utils import divide_range

thresholds = divide_range(0.55,0.95,5)

base_ada_params = [{'corr_selector__threshold': thresholds,
'model__n_estimators':[int(i) for i in divide_range(20,50,5)],
'model__learning_rate': divide_range(0.5,2,5),
'model__algorithm':['SAMME', 'SAMME.R']
}]


In [23]:
ada_pipe = create_pipe('AdaBoost')
base_ada_results_df_v1 = get_grid_results_df(ada_pipe,
                                             'base_ada_results_df_v1', dir, 
                                             param_grid=base_ada_params,
                                             verbosity=3)


Let's see what the top scores are.

In [24]:
present_score_counts(base_ada_results_df_v1)
best_score = score_stats(base_ada_results_df_v1)

---  Score Counts  ---
Precision: 0.8737315821516992, Accuracy: 0.8571224639110847
Count: 1

Precision: 0.8715598612273361, Accuracy: 0.8505849610338364
Count: 1

Precision: 0.871110904500443, Accuracy: 0.8542265339844135
Count: 1

Precision: 0.8702183448458646, Accuracy: 0.8482624954786097
Count: 1

---  Score Stats  ---
Most Common: Precision: -1
             Accuracy: -1
             Count: 72
Max Score: Precision: 0.8737315821516992
           Accuracy: 0.8571224639110847
           Count: 1
Max Precision: 0.8737315821516992
Max Accuracy: 0.8572946220117721


Let's look at the parameters associated with the top 5 scores.

In [25]:
from src.model_eval import collect_like_estimators

def top_n(results_df,num=5,exclude=None):
    estimators_by_score = collect_like_estimators(results_df)
    scores = list(estimators_by_score.keys())
    top = sorted(scores,reverse=True)[:num]
    for score in top:
        print(f"Score: {score}")
        present_param_counts(results_df, score, exclude)

top_n(base_ada_results_df_v1)

Score: (0.8737315821516992, 0.8571224639110847)
model__n_estimators: 50, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.79, Count: 1
Score: (0.8715598612273361, 0.8505849610338364)
model__n_estimators: 50, Count: 1
model__learning_rate: 1.4, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.79, Count: 1
Score: (0.871110904500443, 0.8542265339844135)
model__n_estimators: 44, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.79, Count: 1
Score: (0.8702183448458646, 0.8482624954786097)
model__n_estimators: 44, Count: 1
model__learning_rate: 1.4, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.79, Count: 1
Score: (0.8697431824167033, 0.8511012133767387)
model__n_estimators: 50, Count: 2
model__learning_rate: 1.1, Count: 2
model__algorithm: SAMME.R, Count: 2
corr_selector__threshold: 0.87, Count: 1
corr_selector__threshold: 

This gives us an idea of how to narrow down our hyperparameters.
* We will focus on a range estimators from 40-70
* We will use the algorithm `'SAMME.R'`.
* We will focus on a range of learning rate of 0.9 to 1.6
* We will focus on a range of correlation threshold from 0.77 to 0.95

In [26]:
thresholds = divide_range(0.77,0.95,5)

base_ada_params_v2 = [{'corr_selector__threshold': thresholds,
'model__n_estimators':[int(i) for i in divide_range(40,70,5)],
'model__learning_rate': divide_range(0.9,1.6,5),
'model__algorithm':['SAMME.R']
}]

In [27]:
base_ada_results_df_v2 = get_grid_results_df(ada_pipe,'base_ada_results_df_v2',
                                             dir, param_grid=base_ada_params_v2,
                                             verbosity=3)

Let's see what the top scores are.

In [28]:
present_score_counts(base_ada_results_df_v2)
best_score = score_stats(base_ada_results_df_v2)

---  Score Counts  ---
Precision: 0.8781658063833186, Accuracy: 0.8609360469895762
Count: 2

Precision: 0.8778065356002938, Accuracy: 0.8612800919075335
Count: 2

Precision: 0.8776557534098555, Accuracy: 0.8600758524875867
Count: 2

Precision: 0.8774026088524034, Accuracy: 0.858470109499852
Count: 2

---  Score Stats  ---
Most Common: Precision: 0.8586301541127146
             Accuracy: 0.8456532134753871
             Count: 4
Max Score: Precision: 0.8781658063833186
           Accuracy: 0.8609360469895762
           Count: 2
Max Precision: 0.8781658063833186
Max Accuracy: 0.8616242354740061


Let's look at the parameters associated with the top 5 scores.

In [29]:
top_n(base_ada_results_df_v2,6)

Score: (0.8781658063833186, 0.8609360469895762)
model__n_estimators: 70, Count: 2
model__learning_rate: 1.18, Count: 2
model__algorithm: SAMME.R, Count: 2
corr_selector__threshold: 0.77, Count: 1
corr_selector__threshold: 0.806, Count: 1
Score: (0.8778065356002938, 0.8612800919075335)
model__n_estimators: 70, Count: 2
model__learning_rate: 1.04, Count: 2
model__algorithm: SAMME.R, Count: 2
corr_selector__threshold: 0.77, Count: 1
corr_selector__threshold: 0.806, Count: 1
Score: (0.8776557534098555, 0.8600758524875867)
model__n_estimators: 64, Count: 2
model__learning_rate: 1.18, Count: 2
model__algorithm: SAMME.R, Count: 2
corr_selector__threshold: 0.77, Count: 1
corr_selector__threshold: 0.806, Count: 1
Score: (0.8774026088524034, 0.858470109499852)
model__n_estimators: 70, Count: 2
model__learning_rate: 1.32, Count: 2
model__algorithm: SAMME.R, Count: 2
corr_selector__threshold: 0.77, Count: 1
corr_selector__threshold: 0.806, Count: 1
Score: (0.8762517429489284, 0.8603625374042286)
m

It seems that an increase in the number of estimators improves performance, so we will slide the window we look at up a bit. Unfortunately, this will increase the fit time. We will take the average of the correlation thresholds. We feel we didn't cast a wide enough net with respect to learning rate, so we will look at a larger window for this parameter as well.

In [30]:
base_ada_params_v3 = [{'corr_selector__threshold': [0.785],
'model__n_estimators':[int(i) for i in divide_range(60,90,5)],
'model__learning_rate': divide_range(1.1,5,5),
'model__algorithm':['SAMME.R']}]

base_ada_results_df_v3 = get_grid_results_df(ada_pipe,'base_ada_results_df_v3',
                                             dir, param_grid=base_ada_params_v3,
                                             verbosity=3)

In [31]:
present_score_counts(base_ada_results_df_v3)
best_score = score_stats(base_ada_results_df_v3)

---  Score Counts  ---
Precision: 0.8795073416589505, Accuracy: 0.8629717783039033
Count: 1

Precision: 0.8791597605984037, Accuracy: 0.8619968884614121
Count: 1

Precision: 0.8787641714687415, Accuracy: 0.8617101665515767
Count: 1

Precision: 0.8776698890961192, Accuracy: 0.8605919404162968
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.4341568547564283
             Accuracy: 0.3400903949228898
             Count: 18
Max Score: Precision: 0.8795073416589505
           Accuracy: 0.8629717783039033
           Count: 1
Max Precision: 0.8795073416589505
Max Accuracy: 0.8629717783039033


The scores did improve, but not significantly. If the improvement is due to values for learning rate and estimators being outside of our earlier ranges, then we will have to continue searching.

In [32]:
top_n(base_ada_results_df_v3)

Score: (0.8795073416589505, 0.8629717783039033)
model__n_estimators: 90, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8791597605984037, 0.8619968884614121)
model__n_estimators: 84, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8787641714687415, 0.8617101665515767)
model__n_estimators: 78, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8776698890961192, 0.8605919404162968)
model__n_estimators: 66, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8775918870218895, 0.8609072950577094)
model__n_estimators: 72, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1


So the number of estimators jumped a fair bit. In the case of the Logistic Regression model, we got many many high performing sets of parameters. This made it difficult to focus on a few small neighborhoods (in the parameter space) of high performing sets of parameters. Our learning rate has stabilized, so we can focus on tuning it more carefully.

In [33]:
base_ada_params_v4 = [{'corr_selector__threshold': [0.785],
'model__n_estimators':[int(i) for i in divide_range(75,100,5)],
'model__learning_rate': divide_range(1.05,1.15),
'model__algorithm':['SAMME.R']}]

base_ada_results_df_v4 = get_grid_results_df(ada_pipe,'base_ada_results_df_v4',
                                             dir, param_grid=base_ada_params_v4,
                                             verbosity=3)

In [34]:
present_score_counts(base_ada_results_df_v4)
best_score = score_stats(base_ada_results_df_v4)

---  Score Counts  ---
Precision: 0.8814253559391133, Accuracy: 0.8642047285521686
Count: 1

Precision: 0.8813170270135815, Accuracy: 0.8640040363684193
Count: 1

Precision: 0.8810905251394029, Accuracy: 0.8635452550064121
Count: 1

Precision: 0.8809914508725031, Accuracy: 0.8642334681529709
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.8778421580902369
             Accuracy: 0.8612513687481502
             Count: 1
Max Score: Precision: 0.8814253559391133
           Accuracy: 0.8642047285521686
           Count: 1
Max Precision: 0.8814253559391133
Max Accuracy: 0.8642334681529709


In [35]:
top_n(base_ada_results_df_v4)

Score: (0.8814253559391133, 0.8642047285521686)
model__n_estimators: 100, Count: 1
model__learning_rate: 1.1, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8813170270135815, 0.8640040363684193)
model__n_estimators: 95, Count: 1
model__learning_rate: 1.05, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8810905251394029, 0.8635452550064121)
model__n_estimators: 95, Count: 1
model__learning_rate: 1.075, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8809914508725031, 0.8642334681529709)
model__n_estimators: 100, Count: 1
model__learning_rate: 1.15, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1
Score: (0.8809162905873291, 0.8638320426819244)
model__n_estimators: 95, Count: 1
model__learning_rate: 1.15, Count: 1
model__algorithm: SAMME.R, Count: 1
corr_selector__threshold: 0.785, Count: 1


We are closing in on a learning rate of 1.1. Perhaps we need to slide our number of estimators up a bit higher. While we haven't settled on a number of estimators, we have at least settled on a learning rate. This means that the parameter grid has shrunk.

In [50]:
base_ada_params_v5 = [{'corr_selector__threshold': [0.785],
'model__n_estimators':[int(i) for i in divide_range(95,125,10)],
'model__learning_rate': [1.08,1.1,1.12],
'model__algorithm':['SAMME.R']}]

base_ada_results_df_v5 = get_grid_results_df(ada_pipe,'base_ada_results_df_v5',
                                             dir, param_grid=base_ada_params_v5,
                                             verbosity=3)

### Beginning grid search for AdaBoost ###
Fitting 5 folds for each of 33 candidates, totalling 165 fits
[CV 2/5] END corr_selector__threshold=0.785, model__algorithm=SAMME.R, model__learning_rate=1.08, model__n_estimators=95; accuracy: (test=0.857) precision: (test=0.876) total time=  50.2s
[CV 1/5] END corr_selector__threshold=0.785, model__algorithm=SAMME.R, model__learning_rate=1.08, model__n_estimators=95; accuracy: (test=0.859) precision: (test=0.875) total time=  50.7s
[CV 4/5] END corr_selector__threshold=0.785, model__algorithm=SAMME.R, model__learning_rate=1.08, model__n_estimators=95; accuracy: (test=0.868) precision: (test=0.888) total time=  51.1s
[CV 2/5] END corr_selector__threshold=0.785, model__algorithm=SAMME.R, model__learning_rate=1.08, model__n_estimators=98; accuracy: (test=0.858) precision: (test=0.875) total time=  52.0s
[CV 4/5] END corr_selector__threshold=0.785, model__algorithm=SAMME.R, model__learning_rate=1.08, model__n_estimators=98; accuracy: (test=0.868

In [51]:
present_score_counts(base_ada_results_df_v5)
best_score = score_stats(base_ada_results_df_v5)
present_score_counts(base_ada_results_df_v4,2)
present_score_counts(base_ada_results_df_v5,2)

---  Score Counts  ---
Precision: 0.8815191703525593, Accuracy: 0.8645201818420967
Count: 1

Precision: 0.8815010258845912, Accuracy: 0.8643194773272829
Count: 1

Precision: 0.8814859714051714, Accuracy: 0.8645488556772222
Count: 1

Precision: 0.8814721006584776, Accuracy: 0.8646062074578277
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.8802556012391856
             Accuracy: 0.8628284091282759
             Count: 1
Max Score: Precision: 0.8815191703525593
           Accuracy: 0.8645201818420967
           Count: 1
Max Precision: 0.8815191703525593
Max Accuracy: 0.8646062074578277
---  Score Counts  ---
Precision: 0.8814253559391133, Accuracy: 0.8642047285521686
Count: 1

Precision: 0.8813170270135815, Accuracy: 0.8640040363684193
Count: 1

---  Score Counts  ---
Precision: 0.8815191703525593, Accuracy: 0.8645201818420967
Count: 1

Precision: 0.8815010258845912, Accuracy: 0.8643194773272829
Count: 1



The performance is about the same. Let's look at the parameters.

In [52]:
standard_exclude = standard_exclude = ['model__base_estimator',
           'corr_selector__threshold','model__algorithm']
top_n(base_ada_results_df_v5,exclude=standard_exclude)


Score: (0.8815191703525593, 0.8645201818420967)
model__n_estimators: 125, Count: 1
model__learning_rate: 1.12, Count: 1
Score: (0.8815010258845912, 0.8643194773272829)
model__n_estimators: 125, Count: 1
model__learning_rate: 1.08, Count: 1
Score: (0.8814859714051714, 0.8645488556772222)
model__n_estimators: 122, Count: 1
model__learning_rate: 1.12, Count: 1
Score: (0.8814721006584776, 0.8646062074578277)
model__n_estimators: 119, Count: 1
model__learning_rate: 1.12, Count: 1
Score: (0.8812508989725275, 0.86440548239124)
model__n_estimators: 113, Count: 1
model__learning_rate: 1.1, Count: 1


We will use 125 estimators and a learning rate of 1.12. Now to tune the Decision tree specific hyperparameters.

In [36]:
from sklearn.tree import DecisionTreeClassifier


ada_params = {'corr_selector__threshold':[0.785],
            'model__n_estimators': [85],
            'model__learning_rate': [1.1], 
            'model__algorithm':['SAMME.R']}
depths = [5]#[int(i) for i in divide_range(5,30,3)]+[None]
dt_params = {'max_depth': depths,
              'min_samples_split': [int(i) for i in divide_range(2,10,3)],
              'min_samples_leaf': [int(i) for i in divide_range(1,5,3)]}
new_dt_params = {"model__base_estimator__"+key:value 
                 for key,value in dt_params.items()}

ada_params.update(new_dt_params)
ada_params['model__base_estimator'] = [DecisionTreeClassifier(random_state=42)]

In [37]:
ada_dt_results_df_v1 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v1',
                                             dir, param_grid=ada_params,
                                             verbosity=3)


In [38]:
ada_params["model__base_estimator__max_depth"] = [10,20]
ada_dt_results_df_v1_10_20 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v1_10_20',
                                             dir, param_grid=ada_params,
                                             verbosity=3)

In [39]:
present_score_counts(ada_dt_results_df_v1)
best_score = score_stats(ada_dt_results_df_v1)

---  Score Counts  ---
Precision: 0.8637159009070066, Accuracy: 0.836162408339088
Count: 4

Precision: 0.8630923358743281, Accuracy: 0.8356176054717043
Count: 1

Precision: 0.8627457251289586, Accuracy: 0.836678492157443
Count: 1

Precision: 0.862013626646332, Accuracy: 0.83487224606228
Count: 2

---  Score Stats  ---
Most Common: Precision: 0.8637159009070066
             Accuracy: 0.836162408339088
             Count: 4
Max Score: Precision: 0.8637159009070066
           Accuracy: 0.836162408339088
           Count: 4
Max Precision: 0.8637159009070066
Max Accuracy: 0.836678492157443


There are many choices of parameters with the best score. Remember, we have fixed several hyperparameters already, so we will exclude them.

In [40]:
standard_exclude = ['model__base_estimator',
           'model__n_estimators','corr_selector__threshold',
           'model__learning_rate','model__algorithm']
exclude = standard_exclude + ['model__base_estimator__max_depth']
present_param_counts(ada_dt_results_df_v1, best_score, exclude)

model__base_estimator__min_samples_split: 2, Count: 1
model__base_estimator__min_samples_split: 4, Count: 1
model__base_estimator__min_samples_split: 7, Count: 1
model__base_estimator__min_samples_split: 10, Count: 1
model__base_estimator__min_samples_leaf: 5, Count: 4


It seems we have only learned that `min_samples_leaf` being 5 is a decent hyperparameter. Let's try some other values for `max_depth` and see how the results compare.

In [41]:
ada_params["model__base_estimator__max_depth"] = [10,20]
ada_dt_results_df_v1_10_20 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v1_10_20',
                                             dir, param_grid=ada_params,
                                             verbosity=3)


In [42]:
present_score_counts(ada_dt_results_df_v1_10_20)
best_score = score_stats(ada_dt_results_df_v1_10_20)

---  Score Counts  ---
Precision: 0.8646618680321028, Accuracy: 0.854255142053862
Count: 1

Precision: 0.8633051747789265, Accuracy: 0.855832153661504
Count: 1

Precision: 0.8627878490164538, Accuracy: 0.8566637853079477
Count: 1

Precision: 0.8622545494609278, Accuracy: 0.8566637647561738
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.8573080662106897
             Accuracy: 0.8430727491697084
             Count: 4
Max Score: Precision: 0.8646618680321028
           Accuracy: 0.854255142053862
           Count: 1
Max Precision: 0.8646618680321028
Max Accuracy: 0.8571798280227549


The top performers in this grid search outperformed the top performers when the `max_depth` was set to 5. So we will no longer use that value (which means it will take longer to train the models).

In [43]:
top_n(ada_dt_results_df_v1_10_20, exclude=standard_exclude)

Score: (0.8646618680321028, 0.854255142053862)
model__base_estimator__min_samples_split: 2, Count: 1
model__base_estimator__min_samples_leaf: 1, Count: 1
model__base_estimator__max_depth: 20, Count: 1
Score: (0.8633051747789265, 0.855832153661504)
model__base_estimator__min_samples_split: 4, Count: 1
model__base_estimator__min_samples_leaf: 1, Count: 1
model__base_estimator__max_depth: 20, Count: 1
Score: (0.8627878490164538, 0.8566637853079477)
model__base_estimator__min_samples_split: 7, Count: 1
model__base_estimator__min_samples_leaf: 1, Count: 1
model__base_estimator__max_depth: 20, Count: 1
Score: (0.8622545494609278, 0.8566637647561738)
model__base_estimator__min_samples_split: 7, Count: 1
model__base_estimator__min_samples_leaf: 3, Count: 1
model__base_estimator__max_depth: 20, Count: 1
Score: (0.861991008224668, 0.8558895465456577)
model__base_estimator__min_samples_split: 2, Count: 1
model__base_estimator__min_samples_split: 4, Count: 1
model__base_estimator__min_samples_leaf

For the next pass, we will fix the `max_depth` at 20 and see if we can't learn anything more about what the `min_samples_leaf` and `min_samples_split` parameters should be.

In [44]:
ada_params['model__base_estimator__min_samples_leaf'] = [1,2,3,5]
ada_params['model__base_estimator__min_samples_split'] = [2]#4,7,10]
ada_params['model__base_estimator__max_depth'] = [20]

ada_dt_results_df_v2_20_2 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v2_20_2',
                                             dir, param_grid=ada_params,
                                             verbosity=3)

ada_params['model__base_estimator__min_samples_split'] = [4]#,7,10]

ada_dt_results_df_v2_20_4 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v2_20_4',
                                             dir, param_grid=ada_params,
                                             verbosity=3)

ada_params['model__base_estimator__min_samples_split'] = [7]

ada_dt_results_df_v2_20_7 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v2_20_7',
                                             dir, param_grid=ada_params,
                                             verbosity=3)

ada_params['model__base_estimator__min_samples_split'] = [10]

ada_dt_results_df_v2_20_10 = get_grid_results_df(ada_pipe,'ada_dt_results_df_v2_20_10',
                                             dir, param_grid=ada_params,
                                             verbosity=3)


In [45]:
# min_sample_splits = 2
present_score_counts(ada_dt_results_df_v2_20_2)
best_score = score_stats(ada_dt_results_df_v2_20_2)

top_n(ada_dt_results_df_v2_20_2, exclude=exclude)

---  Score Counts  ---
Precision: 0.8641704889968495, Accuracy: 0.8496101575087962
Count: 1

Precision: 0.8637161315700354, Accuracy: 0.8525061367597251
Count: 1

Precision: 0.8634434325246634, Accuracy: 0.8526495059353524
Count: 1

Precision: 0.8624377086686301, Accuracy: 0.8500689470915128
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.8637161315700354
             Accuracy: 0.8525061367597251
             Count: 1
Max Score: Precision: 0.8641704889968495
           Accuracy: 0.8496101575087962
           Count: 1
Max Precision: 0.8641704889968495
Max Accuracy: 0.8526495059353524
Score: (0.8641704889968495, 0.8496101575087962)
model__base_estimator__min_samples_split: 2, Count: 1
model__base_estimator__min_samples_leaf: 5, Count: 1
Score: (0.8637161315700354, 0.8525061367597251)
model__base_estimator__min_samples_split: 2, Count: 1
model__base_estimator__min_samples_leaf: 1, Count: 1
Score: (0.8634434325246634, 0.8526495059353524)
model__base_estimator__min_samples_split: 

In [None]:
# min_sample_splits = 4
present_score_counts(ada_dt_results_df_v2_20_4)
best_score = score_stats(ada_dt_results_df_v2_20_4)

---  Score Counts  ---
Precision: 0.8645257853006113, Accuracy: 0.8530796751175561
Count: 1

Precision: 0.8641704889968495, Accuracy: 0.8496101575087962
Count: 1

Precision: 0.8634434325246634, Accuracy: 0.8526495059353524
Count: 1

Precision: 0.8624377086686301, Accuracy: 0.8500689470915128
Count: 1

---  Score Stats  ---
Most Common: Precision: 0.8645257853006113
             Accuracy: 0.8530796751175561
             Count: 1
Max Score: Precision: 0.8645257853006113
           Accuracy: 0.8530796751175561
           Count: 1
Max Precision: 0.8645257853006113
Max Accuracy: 0.8530796751175561


The training time is growing rapidly with the max_depth parameter of the Decision trees. We will stick to a few relatively small values for this parameter while we continue to tune the others.

In order to get intermediate results, we will train the AdaBoost model with different base estimators separately.