# Build Grid Search functions

In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.

In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise.

In [1]:
import pandas as pd
df = pd.read_csv("dataset/credit-card-full.csv")
# df.head()
# df.select_dtypes(include="int")
# df['default payment next month']

# from sklearn.linear_model import LogisticRegression
y= df['default payment next month']
X = df.drop('default payment next month', axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
# log_reg_clf = LogisticRegression()
# log_reg_clf.fit(X_train, y_train)

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
# Create the function
def gbm_grid_search(learning_rate, max_depth):

	# Create the model
    model = GradientBoostingClassifier(learning_rate=learning_rate, max_depth=max_depth)
    
    # Use the model to make predictions
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Return the hyperparameters and score
    return([learning_rate, max_depth, accuracy_score(y_test, predictions)])

# Iteratively tune multiple hyperparameters

In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter.

In [3]:
# Create the relevant lists
results_list = [ ]
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2, 4,6]

# Create the for loop
for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        results_list.append(gbm_grid_search(learn_rate,max_depth))

# Print the results
print(results_list)   

[[0.01, 2, 0.8146666666666667], [0.01, 4, 0.811], [0.01, 6, 0.8068333333333333], [0.1, 2, 0.8191666666666667], [0.1, 4, 0.8178333333333333], [0.1, 6, 0.8166666666666667], [0.5, 2, 0.818], [0.5, 4, 0.8025], [0.5, 6, 0.789]]


In [4]:
results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2,4,6]

# Extend the function input
def gbm_grid_search_extended(learn_rate, max_depth, subsample):

	# Extend the model creation section
    model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth, subsample=subsample)
    
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Extend the return part
    return([learn_rate, max_depth, predictions, accuracy_score(y_test, predictions)])       

In [5]:
results_list = []

# Create the new list to test
subsample_list = [0.4 , 0.6]

for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
    
    	# Extend the for loop
        for subsample in subsample_list:
        	
            # Extend the results to include the new hyperparameter
            results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))
            
# Print results
print(results_list)            

[[0.01, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.812], [0.01, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8135], [0.01, 4, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8118333333333333], [0.01, 4, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8121666666666667], [0.01, 6, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8095], [0.01, 6, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8098333333333333], [0.1, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8188333333333333], [0.1, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8195], [0.1, 4, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8156666666666667], [0.1, 4, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8168333333333333], [0.1, 6, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8121666666666667], [0.1, 6, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8196666666666667], [0.5, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8173333333333334], [0.5, 2, array([0, 0, 0, ..., 0, 0, 1], dtype=int64), 0.8155], [0.5

# How Many Models?

Adding more hyperparameters or values, you increase the amount of models created but the increases is not linear it is proportional to how many values and hyperparameters you already have.

How many models would be created when running a grid search over the following hyperparameters and values for a GBM algorithm?


In [6]:
learning_rate = [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 2]
max_depth = [4,6,8,10,12,14,16,18, 20]
subsample = [0.4, 0.6, 0.7, 0.8, 0.9]
max_features = ['auto', 'sqrt', 'log2']

print(len(learning_rate)*len(max_depth)*len(subsample)*len(max_features))

1215


# GridSearchCV inputs

Let's test your knowledge of GridSeachCV inputs by answering the question below.

Three GridSearchCV objects are available in the console, named `model_1`, `model_2`, `model_3`. Note that there is no data available to fit these models. Instead, you must answer by looking at their construct.

Which of these `GridSearchCV` objects would not work when we try to fit it?

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


model1 = GridSearchCV(
    estimator = RandomForestClassifier(),
    param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']},
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True) 

model2 = GridSearchCV(
    estimator = KNeighborsClassifier(),
    param_grid = {'n_neighbors': [5, 10, 20], 'algorithm': ['ball_tree', 'brute']},
    scoring='accuracy',
    n_jobs=8,
    cv=10,
    refit=False)

model3 = GridSearchCV(
    estimator = GradientBoostingClassifier(),
    param_grid = {'number_attempts': [2, 4, 6], 'max_depth': [3, 6, 9, 12]},
    scoring='accuracy',
    n_jobs=2,
    cv=7,
    refit=True)


In [8]:
model1.fit(X_train, y_train)

20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
14 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\_param_validation.py

In [9]:
model2.fit(X_train, y_train)


In [10]:
model3.fit(X_train, y_train)


ValueError: Invalid parameter 'number_attempts' for estimator GradientBoostingClassifier(). Valid parameters are: ['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'].

# GridSearchCV with Scikit Learn

The `GridSearchCV` module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a `GridSearchCV` object with certain parameters.

In [13]:
from sklearn.metrics import roc_auc_score
# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parameter grid
param_grid = {'max_depth': [2, 4, 8, 15]} #, 'max_features': ['auto' , 'sqrt']} 

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring=roc_auc_score,
    n_jobs=4,
    cv=2,
    refit=True, return_train_score= True)
print(grid_rf_class)

GridSearchCV(cv=2, estimator=RandomForestClassifier(criterion='entropy'),
             n_jobs=4, param_grid={'max_depth': [2, 4, 8, 15]},
             return_train_score=True,
             scoring=<function roc_auc_score at 0x0000027F2E841700>)


# Using the best outputs

Which of the following parameters must be set in order to be able to directly use the `best_estimator_` property for predictions?

- `refit = True`

# Exploring the grid search results

You will now explore the `cv_results_` property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.



In [14]:
grid_rf_class.fit(X_train, y_train)



In [15]:
# Read the cv_results property into a dataframe & print it out
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)

   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       1.391439      0.006984         0.000780        0.000234   
1       2.450631      0.009475         0.000957        0.000024   
2       4.517208      0.029454         0.000706        0.000454   
3       7.498887      0.014232         0.000997        0.000000   

  param_max_depth             params  split0_test_score  split1_test_score  \
0               2   {'max_depth': 2}                NaN                NaN   
1               4   {'max_depth': 4}                NaN                NaN   
2               8   {'max_depth': 8}                NaN                NaN   
3              15  {'max_depth': 15}                NaN                NaN   

   mean_test_score  std_test_score  rank_test_score  split0_train_score  \
0              NaN             NaN                1                 NaN   
1              NaN             NaN                1                 NaN   
2              NaN             NaN              

In [23]:


# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:,'params']
print(column)



0     {'max_depth': 2}
1     {'max_depth': 4}
2     {'max_depth': 8}
3    {'max_depth': 15}
Name: params, dtype: object


In [16]:
# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
print(best_row)

   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       1.391439      0.006984         0.000780        0.000234   
1       2.450631      0.009475         0.000957        0.000024   
2       4.517208      0.029454         0.000706        0.000454   
3       7.498887      0.014232         0.000997        0.000000   

  param_max_depth             params  split0_test_score  split1_test_score  \
0               2   {'max_depth': 2}                NaN                NaN   
1               4   {'max_depth': 4}                NaN                NaN   
2               8   {'max_depth': 8}                NaN                NaN   
3              15  {'max_depth': 15}                NaN                NaN   

   mean_test_score  std_test_score  rank_test_score  split0_train_score  \
0              NaN             NaN                1                 NaN   
1              NaN             NaN                1                 NaN   
2              NaN             NaN              

# Analyzing the best results

At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's `gridSearchCv` objects have a number of parameters that provide key information on just the best square (or row in `cv_results_`).

In [33]:
# Print out the ROC_AUC score from the best-performing square
best_score = grid_rf_class.best_score_
print(best_score)

# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[grid_rf_class.best_index_]
print(best_row)

# Get the n_estimators parameter from the best-performing square and print
best_n_estimators = grid_rf_class.best_estimator_.n_estimators
print(best_n_estimators)


nan
mean_fit_time                 1.391439
std_fit_time                  0.006984
mean_score_time                0.00078
std_score_time                0.000234
param_max_depth                      2
params                {'max_depth': 2}
split0_test_score                  NaN
split1_test_score                  NaN
mean_test_score                    NaN
std_test_score                     NaN
rank_test_score                      1
split0_train_score                 NaN
split1_train_score                 NaN
mean_train_score                   NaN
std_train_score                    NaN
Name: 0, dtype: object
100


# Using the best results

While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.

We can access this object through the `best_estimator_` property of our grid search object.

In [35]:
from sklearn.metrics import confusion_matrix
# See what type of object the best_estimator_ property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_ property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix 
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:,1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))


<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[0 0 0 0 0]
Confusion Matrix 
 [[4580   58]
 [1186  176]]
ROC-AUC Score 
 0.7579401376232476
