Hyperparameter tuning is a critical step in optimizing the performance of machine learning models. RandomizedSearchCV is a technique used for hyperparameter tuning, which involves randomly sampling combinations of hyperparameters from predefined ranges and evaluating them using cross-validation. Here's a description of the process:

1. Define Hyperparameter Search Space:
Hyperparameters: Identify the hyperparameters of the machine learning algorithm that you want to optimize. These could include parameters like learning rate, regularization strength, tree depth, etc.

Parameter Distributions: Define distributions or ranges for each hyperparameter that RandomizedSearchCV will sample from during the search process.

2. Set Up Cross-Validation:
Cross-Validation Strategy: Choose a cross-validation strategy (e.g., k-fold cross-validation) to evaluate the performance of different hyperparameter configurations.

Scoring Metric: Define a performance metric (e.g., accuracy, F1-score, ROC AUC) to optimize during the hyperparameter search.

3. Perform Randomized Search:
Random Sampling: RandomizedSearchCV samples a specified number of hyperparameter combinations from the defined search space.

Model Training and Evaluation: For each sampled combination of hyperparameters, RandomizedSearchCV trains the model on a subset of the training data and evaluates its performance using cross-validation.

Selection of Best Model: After evaluating all combinations, RandomizedSearchCV selects the model with the best performance based on the specified scoring metric.

4. Refinement and Optimization:
Refinement Iterations: Iterate the process by adjusting the search space or increasing the number of iterations to further refine the hyperparameters and improve model performance.

In [1]:
#Import Required Library [Details are available in README.md file]
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import os
import matplotlib.pyplot as plt
import pickle
import numpy as np


In [2]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: E:\upgrade_capston_project-main


In [3]:
preprocessed_data_dir = parent_dir+'/datasets/processed_dataset/'

In [4]:
#Load the preprocessed data
with open(os.path.join(preprocessed_data_dir,'X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Load y_train from file
with open(os.path.join(preprocessed_data_dir,'y_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)


In [5]:
# random forest classifier - hyperparameter tuning

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 3)]

# Number of features to consider at every split
criterion = ['gini', 'entropy']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 3)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# class weight
class_weight = ['balanced', 'balanced_subsample', None]


# Create the random grid
param_grid = {'n_estimators': n_estimators,
               'criterion': criterion,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'class_weight':class_weight}
             


print(param_grid)


{'n_estimators': [200, 600, 1000], 'criterion': ['gini', 'entropy'], 'max_depth': [10, 60, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False], 'class_weight': ['balanced', 'balanced_subsample', None]}


In [6]:
# baseline random forest classifier (for hyperparameter tuning)
rfc_t = RandomForestClassifier()

In [7]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rfc_t
                               ,param_distributions = param_grid
                               ,n_iter = 10
                               ,cv = 3
                               ,verbose=0
                               ,random_state=42
                               ,n_jobs = -1)


In [8]:
# Fit the random search model
rf_random.fit(X_train,y_train)

In [9]:
# get the best params
rf_random.best_params_

{'n_estimators': 600,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_depth': 60,
 'criterion': 'gini',
 'class_weight': None,
 'bootstrap': False}

In [10]:
# random forest classifier - tuned
rfc_t = rf_random.best_estimator_

In [11]:
# fit with best parameters
rfc_t.fit(X_train,y_train)

In [12]:
# Save the trained model to a file
model_dir = parent_dir+'/models/'
with open(os.path.join(model_dir,'rfc_hyperTuningRandomSerachCV.pkl'), 'wb') as f:
    pickle.dump(rf_random, f)