<a href="https://colab.research.google.com/github/sahar-mariam/level2-report/blob/main/hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Real-World application of Hyperparameter Tuning using the popular breast cancer dataset from Scikit-learn. Using a Random Forest Classifier and perform hyperparameter Tuning using RandomizedSearchCV for efficient exploration of the hyperparameter space.


The code demonstrates the process of building a Random Forest Classifier for breast cancer classification, optimizing its hyperparameters using RandomizedSearchCV, and evaluating its performance on a test set.

The goal is to find the combination of hyperparameters that maximizes the model's ability to accurately classify breast tumors as either malignant or benign.

In [1]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

In [3]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


- Set up a hyperparameter distribution to search using RandomizedSearchCV. The distribution includes different values for the number of trees (n_estimators), maximum depth of the trees (max_depth), minimum samples required to split an internal node (min_samples_split), minimum samples required to be a leaf node (min_samples_leaf), and whether to bootstrap samples (bootstrap).

- The hyperparameter space to be explored is defined using a dictionary named param_dist. It includes hyperparameters such as the number of trees in the forest (n_estimators), the maximum depth of the trees (max_depth), minimum samples required to split an internal node (min_samples_split), minimum samples required to be a leaf node (min_samples_leaf), and whether to bootstrap samples (bootstrap).
- The RandomizedSearchCV is employed for hyperparameter tuning. It performs a randomized search over the hyperparameter space, conducting a specified number of iterations (n_iter) and using 5-fold cross-validation.

In [4]:
# Define the Random Forest model
rf_model = RandomForestClassifier()

# Define the hyperparameter distribution to search
param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

The RandomizedSearchCV performs a randomized search over the hyperparameter space using 5-fold cross-validation.
After the search is complete, we print the best hyperparameters.

In [5]:
# Use RandomizedSearchCV to find the best hyperparameters
random_search = RandomizedSearchCV(rf_model, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters: ", random_search.best_params_)

Best hyperparameters:  {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': False}


In [6]:
# Make predictions on the test set using the best model
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

Evaluate the performance of the best model on the test set and display a classification report.

In [7]:
# Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# Display additional metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy on the test set: 0.96

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

