# Randomly Sample Hyperparameters

To undertake a random search, we firstly need to undertake a random sampling of our hyperparameter space.

In this exercise, you will firstly create some lists of hyperparameters that can be zipped up to a list of lists. Then you will randomly sample hyperparameter combinations in preparation for running a random search.

In [1]:
import pandas as pd
df = pd.read_csv("dataset/credit-card-full.csv")
# df.head()
# df.select_dtypes(include="int")
# df['default payment next month']

# from sklearn.linear_model import LogisticRegression
y= df['default payment next month']
X = df.drop('default payment next month', axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
# log_reg_clf = LogisticRegression()
# log_reg_clf.fit(X_train, y_train)

In [3]:
from itertools import product
import numpy as np
# Create a list of values for the learning_rate hyperparameter
learn_rate_list = list(np.linspace(0.01,1.5,200))

# Create a list of values for the min_samples_leaf hyperparameter
min_samples_list = list(range(10,41))

# Combination list
combinations_list = [list(x) for x in product(learn_rate_list, min_samples_list)]

# Sample hyperparameter combinations for a random search.
random_combinations_index = np.random.choice(range(0, len(combinations_list)), 250, replace=False)
combinations_random_chosen = [combinations_list[x] for x in random_combinations_index]

# Print the result
print(combinations_random_chosen)

[[1.1555778894472362, 20], [0.968391959798995, 25], [0.3918592964824121, 30], [1.3577386934673366, 13], [1.3802010050251257, 12], [0.12979899497487438, 24], [0.5790452261306532, 16], [1.4101507537688442, 18], [1.1406030150753768, 40], [0.287035175879397, 16], [0.9384422110552764, 14], [1.4401005025125628, 12], [1.245427135678392, 38], [0.039949748743718594, 15], [1.1106532663316582, 31], [1.0432663316582915, 29], [1.327788944723618, 37], [0.03246231155778895, 28], [1.4026633165829145, 19], [1.2604020100502513, 21], [0.6314572864321608, 35], [0.6314572864321608, 17], [0.09236180904522612, 13], [0.19718592964824122, 31], [0.15974874371859296, 13], [0.38437185929648243, 33], [0.968391959798995, 26], [1.4775376884422111, 37], [1.455075376884422, 32], [1.2379396984924622, 24], [1.1406030150753768, 33], [0.2271356783919598, 27], [0.7587437185929649, 34], [0.6239698492462311, 38], [1.327788944723618, 18], [0.6838693467336683, 32], [0.7287939698492463, 15], [0.6389447236180904, 34], [0.5940201

# Randomly Search with Random Forest

To solidify your knowledge of random sampling, let's try a similar exercise but using different hyperparameters and a different algorithm.

As before, create some lists of hyperparameters that can be zipped up to a list of lists. You will use the hyperparameters `criterion`, `max_depth` and `max_features` of the random forest algorithm. Then you will randomly sample hyperparameter combinations in preparation for running a random search.

In [5]:
import random
# Create lists for criterion and max_features
criterion_list = ['gini' , 'entropy']
max_feature_list = ["auto", "sqrt", "log2", None]

# Create a list of values for the max_depth hyperparameter
max_depth_list = list(range(3,56))

# Combination list
combinations_list = [list(x) for x in product(criterion_list, max_feature_list, max_depth_list)]

# Sample hyperparameter combinations for a random search
combinations_random_chosen = random.sample(combinations_list, 150)

# Print the result
print(combinations_random_chosen)

[['gini', None, 47], ['gini', 'log2', 7], ['gini', 'sqrt', 19], ['gini', 'sqrt', 27], ['gini', 'log2', 47], ['entropy', None, 7], ['gini', None, 10], ['gini', 'sqrt', 53], ['entropy', None, 10], ['gini', 'log2', 20], ['gini', 'sqrt', 37], ['gini', 'log2', 29], ['entropy', None, 55], ['entropy', 'auto', 23], ['gini', 'sqrt', 46], ['gini', 'auto', 43], ['entropy', 'auto', 32], ['gini', 'log2', 48], ['entropy', 'log2', 42], ['entropy', 'auto', 33], ['entropy', 'sqrt', 50], ['gini', None, 19], ['entropy', None, 21], ['entropy', 'log2', 54], ['gini', 'log2', 13], ['gini', None, 49], ['entropy', 'log2', 50], ['gini', 'log2', 39], ['entropy', 'auto', 46], ['entropy', None, 3], ['entropy', None, 26], ['gini', 'log2', 54], ['gini', None, 38], ['gini', 'sqrt', 36], ['entropy', 'auto', 24], ['gini', 'sqrt', 38], ['entropy', None, 4], ['entropy', None, 37], ['gini', 'sqrt', 28], ['gini', 'log2', 41], ['gini', 'auto', 42], ['entropy', 'log2', 51], ['entropy', None, 48], ['entropy', None, 53], ['ent

# Visualizing a Random Search

Visualizing the search space of random search allows you to easily see the coverage of this technique and therefore allows you to see the effect of your sampling on the search space.

In this exercise you will use several different samples of hyperparameter combinations and produce visualizations of the search space.

The function `sample_and_visualize_hyperparameters()` takes a single argument (number of combinations to sample) and then randomly samples hyperparameter combinations, just like you did in the last exercise! The function will then visualize the combinations.

In [6]:
import matplotlib.pyplot as plt
def sample_and_visualize_hyperparameters(n_samples):

  # If asking for all combinations, just return the entire list.
  if n_samples == len(combinations_list):
    combinations_random_chosen = combinations_list
  else:
    combinations_random_chosen = []
    random_combinations_index = np.random.choice(range(0, len(combinations_list)), n_samples, replace=False)
    combinations_random_chosen = [combinations_list[x] for x in random_combinations_index]
    
  # Pull out the X and Y to plot
  rand_y, rand_x = [x[0] for x in combinations_random_chosen], [x[1] for x in combinations_random_chosen]

  # Plot 
  plt.clf() 
  plt.scatter(rand_y, rand_x, c=['blue']*len(combinations_random_chosen))
  plt.gca().set(xlabel='learn_rate', ylabel='min_samples_leaf', title='Random Search Hyperparameters')
  plt.gca().set_xlim(x_lims)
  plt.gca().set_ylim(y_lims)
  plt.show()

In [8]:
# # Confirm how many hyperparameter combinations & print
# number_combs = len(combinations_list)
# print(number_combs)

# # Sample and visualise specified combinations
# for x in [50, 500, 1500]:
#     sample_and_visualize_hyperparameters(x)
    
# # Sample all the hyperparameter combinations & visualise
# sample_and_visualize_hyperparameters(number_combs)

# RandomSearchCV inputs

Let's test your knowledge of how RandomizedSearchCV differs from GridSearchCV.

You can check the documentation on Scitkit Learn's website to compare these two functions.

Which of these parameters is only for a `RandomizedSearchCV`?

- `n_iter`

# The RandomizedSearchCV Object

Just like the `GridSearchCV` library from Scikit Learn, `RandomizedSearchCV` provides many useful features to assist with efficiently undertaking a random search. You're going to create a `RandomizedSearchCV` object, making the small adjustment needed from the `GridSearchCV` object.

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
# Create the parameter grid
param_grid = {'learning_rate': np.linspace(0.1,2,150), 'min_samples_leaf': list(range(20,65))} 

# Create a random search object
random_GBM_class = RandomizedSearchCV(
    estimator = GradientBoostingClassifier(),
    param_distributions= param_grid,
    n_iter = 10,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

# Fit to the training data
random_GBM_class.fit(X_train , y_train)

# Print the values used for both hyperparameters
print(random_GBM_class.cv_results_['param_learning_rate'])
print(random_GBM_class.cv_results_['param_min_samples_leaf'])

In [10]:
random_GBM_class.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_min_samples_leaf', 'param_learning_rate', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])

# RandomSearchCV in Scikit Learn

Let's practice building a `RandomizedSearchCV` object using Scikit Learn.

In [11]:
from sklearn.ensemble import RandomForestClassifier
# Create the parameter grid
param_grid = {'max_depth': list(range(5,26)), 'max_features': ['auto' , 'sqrt']} 

# Create a random search object
random_rf_class = RandomizedSearchCV(
    estimator = RandomForestClassifier(n_estimators=80),
    param_distributions = param_grid, n_iter = 5,
    scoring='roc_auc', n_jobs=4, cv = 3, refit=True, return_train_score = True )

# Fit to the training data
random_rf_class.fit(X_train, y_train)

# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])

6 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\88016\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\_param_validation.py",

[5 16 23 16 19]
['sqrt' 'auto' 'sqrt' 'sqrt' 'auto']


# Comparing Random & Grid Search

you just studied some of the advantages and disadvantages of random search as compared to grid search.

Which of the following is an advantage of random search?

- It is more computationally efficient than Grid Search.

# Grid and Random Search Side by Side

Visualizing the search space of random and grid search together allows you to easily see the coverage that each technique has and therefore brings to life their specific advantages and disadvantages.

In this exercise, you will sample hyperparameter combinations in a grid search way as well as a random search way, then plot these to see the difference.

In [16]:
x_lims = [0.01, 3.0]
y_lims = [5, 24]
def visualize_search(grid_combinations_chosen, random_combinations_chosen):
  grid_y, grid_x = [x[0] for x in grid_combinations_chosen], [x[1] for x in grid_combinations_chosen]
  rand_y, rand_x = [x[0] for x in random_combinations_chosen], [x[1] for x in random_combinations_chosen]

  # Plot all together
  plt.scatter(grid_y + rand_y, grid_x + rand_x, c=['red']*300 + ['blue']*300)
  plt.gca().set(xlabel='learn_rate', ylabel='min_samples_leaf', title='Grid and Random Search Hyperparameters')
  plt.gca().set_xlim(x_lims)
  plt.gca().set_ylim(y_lims)
  plt.show()
  


In [17]:
# Sample grid coordinates
grid_combinations_chosen = combinations_list[0:300]

# Print result
print(grid_combinations_chosen)

[['gini', 'auto', 3], ['gini', 'auto', 4], ['gini', 'auto', 5], ['gini', 'auto', 6], ['gini', 'auto', 7], ['gini', 'auto', 8], ['gini', 'auto', 9], ['gini', 'auto', 10], ['gini', 'auto', 11], ['gini', 'auto', 12], ['gini', 'auto', 13], ['gini', 'auto', 14], ['gini', 'auto', 15], ['gini', 'auto', 16], ['gini', 'auto', 17], ['gini', 'auto', 18], ['gini', 'auto', 19], ['gini', 'auto', 20], ['gini', 'auto', 21], ['gini', 'auto', 22], ['gini', 'auto', 23], ['gini', 'auto', 24], ['gini', 'auto', 25], ['gini', 'auto', 26], ['gini', 'auto', 27], ['gini', 'auto', 28], ['gini', 'auto', 29], ['gini', 'auto', 30], ['gini', 'auto', 31], ['gini', 'auto', 32], ['gini', 'auto', 33], ['gini', 'auto', 34], ['gini', 'auto', 35], ['gini', 'auto', 36], ['gini', 'auto', 37], ['gini', 'auto', 38], ['gini', 'auto', 39], ['gini', 'auto', 40], ['gini', 'auto', 41], ['gini', 'auto', 42], ['gini', 'auto', 43], ['gini', 'auto', 44], ['gini', 'auto', 45], ['gini', 'auto', 46], ['gini', 'auto', 47], ['gini', 'auto',

In [20]:

# Create a list of sample indexes
sample_indexes = list(range(0,len(combinations_list)))

# Randomly sample 300 indexes
random_indexes = np.random.choice(sample_indexes, 300, replace=False)

# Use indexes to create random sample
random_combinations_chosen = [combinations_list[index] for index in random_indexes]


# # Call the function to produce the visualization
# visualize_search(grid_combinations_chosen, random_combinations_chosen)