## Hoe to improve our models

Most of the time, the first model you trains on a task you not be the last, but it is useful nonethless as a baseline which we will strive to improve on, ussualy this can be done in a feew ways

1. Collect more data
2. Improve our data (AKA Feature Engineering)
3. Select a better model
4. Tweak our model's hyperparameters

We will focus on 3 and 4 for now

Note: Hyperparameters are used by us to try and improve our model, they are diffrente from parameters which are learned by the model during training and used during the prediction process

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
heart_disease = pd.read_csv("resources/heart-disease.csv")
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [3]:
from sklearn.ensemble import RandomForestClassifier
rfr = RandomForestClassifier()

In [4]:
# Lets see what are our hyperparameters for a RandomForestClassifiers
rfr.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Now that we have a baseline model and score (accuracy in this case) we can tune our hyperparameters to see if we can do better, there are 3 ways to do that

1. Trial-and-error
2. Random Search with RandomSearchCV
3. Exhaustive Search with GridSearchCV

In any case, hyperparameter tuning introduces another dataset, the validation set, which is separated from the training and test sets

In [5]:
np.random.seed(42)
# Shuffle the data so we get random intances in each split
shuffled = heart_disease.sample(frac=1)

# Split x and y
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Compute training and validation split points
train_split = int(0.7 * len(x))
val_split = train_split + int(0.15 * train_split)

# Splite the data
x_train, y_train = x[:train_split], y[:train_split]
x_val, y_val = x[train_split:val_split], y[train_split: val_split]
x_test, y_test = x[val_split:], y[val_split]

In [6]:
# Train and score the baseline classificator
baseline = RandomForestClassifier()
baseline.fit(x_train, y_train)
baseline.score(x_val, y_val)

0.7419354838709677

In [7]:
# Traing a classificator with the hyperparameters tuned by hand
by_hand = RandomForestClassifier(max_depth=10)
by_hand.fit(x_train, y_train)
by_hand.score(x_val, y_val)

0.7419354838709677

In this case, changing the hyperparameters did not improve the model's accuracy, also, tuning each hyperparameter by hand looking for the optimal combination is a lot of work, of course Sklearn can help us tune our models!

In [8]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split

# RandomSearchCV creates our validation tests by itself, no need to do it by hand
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3)

# the grid of possible values to choose from for each hyperparameter
# note that the dict keys match the hypeparameter names from .get_params()
grid = {
    'n_estimators': [10, 100, 200, 500, 1000, 1200],
    'max_depth': [None, 5, 10, 20, 30] ,
    'max_features': ["auto", "sqrt"],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf' : [1,2,4]
}

np.random.seed(42)

# RandomSearchCV will try random combinations of hyperparameters
random_cv = RandomizedSearchCV(RandomForestClassifier(n_jobs=-1), 
                               grid, 
                               cv=5, 
                               n_iter=10)
# Fit the model
random_cv.fit(x_train, y_train)

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1),
                   param_distributions={'max_depth': [None, 5, 10, 20, 30],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 4, 6],
                                        'n_estimators': [10, 100, 200, 500,
                                                         1000, 1200]})

In [9]:
# Lets take a look at the best combination found
random_cv.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 20}

In [10]:
# And we can use the best estimator
random_cv.best_estimator_.score(x_test, y_test)

0.8131868131868132

As we can see, RandomSearchCV found a better set of hyperparameter and improved our model,
still there is one other way to optimize your hyperparameters: Grid Search

In [15]:
# Grid Search will try every possible combination of parameters from our grid
from sklearn.model_selection import GridSearchCV

# Since it does an exhaustive search, it will may take a long time
# specially if we have a really big dataset or Parameter Grid
grid_2 = {
    'n_estimators': [10, 100, 200],
    'max_depth': [None, 5, 10] ,
    'max_features': ["auto", "sqrt"],
    'min_samples_split': [2, 4],
    'min_samples_leaf' : [1, 2]
}

grid_cv = GridSearchCV(RandomForestClassifier(), grid_2, cv=5)

grid_cv.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 5, 10],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [1, 2],
                         'min_samples_split': [2, 4],
                         'n_estimators': [10, 100, 200]})

In [16]:
grid_cv.best_params_

{'max_depth': 5,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 10}

In [17]:
grid_cv.best_estimator_.score(x_test, y_test)

0.8241758241758241

Again, the GridSearchCV shows some improvement over the baseline model