# Hyperparameters Tuning

## Tree hyperparameters
In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.

Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.

We have instantiated a DecisionTreeClassifier and assigned to dt with sklearn's default hyperparameters. You can inspect the hyperparameters of dt in your console.

Which of the following is not a hyperparameter of dt?

In [4]:
from sklearn.tree import DecisionTreeClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

print(dt.get_params)

<bound method BaseEstimator.get_params of DecisionTreeClassifier(max_depth=2, random_state=1)>


## Set the tree's hyperparameter grid
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree dt and find the optimal classifier in the next exercise.


* Define a grid of hyperparameters corresponding to a Python dictionary called params_dt with:

the key 'max_depth' set to a list of values 2, 3, and 4

the key 'min_samples_leaf' set to a list of values 0.12, 0.14, 0.16, 0.18

In [5]:
# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18],
    'max_features': [0.2, 0.4, 0.6, 0.8]
}

## Search for the optimal tree
In this exercise, you'll perform grid search using 5-fold cross validation to find dt's optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

grid_object.fit(X_train, y_train)
An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.


* Import GridSearchCV from sklearn.model_selection.

* Instantiate a GridSearchCV object using 5-fold CV by setting the parameters:

estimator to dt, param_grid to params_dt and

scoring to 'roc_auc'.

In [11]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

## Evaluate the optimal tree
In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.

In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.

The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. X_test, y_test are available in your workspace. In addition, we have also loaded the trained GridSearchCV object grid_dt that you instantiated in the previous exercise. Note that grid_dt was trained as follows:

grid_dt.fit(X_train, y_train)

* Import roc_auc_score from sklearn.metrics.

* Extract the .best_estimator_ attribute from grid_dt and assign it to best_model.

* Predict the test set probabilities of obtaining the positive class y_pred_proba.

* Compute the test set ROC AUC score test_roc_auc of best_model.

In [14]:
import pandas as pd
import numpy as np
liver_dataset = pd.read_csv('/kaggle/input/indian-liver-patient-dataset/Indian Liver Patient Dataset (ILPD).csv')
liver_dataset.head()

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN



# Display basic info about the dataset
print("Dataset shape:", liver_dataset.shape)
print("\nDataset info:")
print(liver_dataset.info())
print("\nFirst few rows:")
print(liver_dataset.head())

# Data preprocessing
# Check for missing values
print("\nMissing values:")
print(liver_dataset.isnull().sum())

# Handle missing values if any
liver_dataset = liver_dataset.dropna()

# Convert categorical variables (like gender) to numerical
liver_dataset['gender'] = liver_dataset['gender'].map({'Female': 0, 'Male': 1})

# Prepare features (X) and target (y)
# Using all features for prediction
X = liver_dataset.drop('is_patient', axis=1)  # Assuming 'is_patient' is the target
y = liver_dataset['is_patient']

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# X_train.fillna()
# X_test.fillna()
# y_train.fillna()
# y_test.fillna()

# Display split results
print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Target distribution in training set:")
print(y_train.value_counts(normalize=True))
print(f"Target distribution in test set:")
print(y_test.value_counts(normalize=True))

Dataset shape: (583, 11)

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               583 non-null    int64  
 1   gender            583 non-null    object 
 2   tot_bilirubin     583 non-null    float64
 3   direct_bilirubin  583 non-null    float64
 4   tot_proteins      583 non-null    int64  
 5   albumin           583 non-null    int64  
 6   ag_ratio          583 non-null    int64  
 7   sgpt              583 non-null    float64
 8   sgot              583 non-null    float64
 9   alkphos           579 non-null    float64
 10  is_patient        583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB
None

First few rows:
   age  gender  tot_bilirubin  direct_bilirubin  tot_proteins  albumin  \
0   65  Female            0.7               0.1           187       16   
1   62    M

In [18]:
# Fit grid_dt to the training data
grid_dt.fit(X_train, y_train)

#Extract best  hyperparameters from grid_dt
best_hyperparams = grid_dt.best_params_

print("Best hyperparameters:\n", best_hyperparams)

# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('BEST CV accuracy'.format(best_CV_score))

Best hyperparameters:
 {'max_depth': 3, 'max_features': 0.6, 'min_samples_leaf': 0.12}
BEST CV accuracy


In [19]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.727


# Random forests hyperparameters
In the following exercises, you'll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor.

We have instantiated a RandomForestRegressor called rf using sklearn's default hyperparameters. You can inspect the hyperparameters of rf in your console.

Which of the following is not a hyperparameter of rf?

In [20]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
            random_state=2)
            

print(rf.get_params)

<bound method BaseEstimator.get_params of RandomForestRegressor(n_estimators=25, random_state=2)>


## Set the hyperparameter grid of RF
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune rf's hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.


* Define a grid of hyperparameters corresponding to a
* Python dictionary called params_rf with:

the key 'n_estimators' set to a list of values 100, 350, 500

the key 'max_features' set to a list of values 'log2', 'auto', 'sqrt'

the key 'min_samples_leaf' set to a list of values 2, 10, 30

In [21]:
bike_dataset = pd.read_csv('/kaggle/input/london-bike-sharing-dataset/london_merged.csv')
bike_dataset.head()

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0


In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
bike_dataset = pd.read_csv('/kaggle/input/london-bike-sharing-dataset/london_merged.csv')

# Prepare the features and target variable
# Assuming 'cnt' is the target variable (bike count)
X = bike_dataset.drop(['timestamp', 'cnt'], axis=1)  # Features (excluding timestamp and target)
y = bike_dataset['cnt']  # Target variable

# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [23]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
            random_state=2)
            


In [24]:
# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators':[100,350,500],
    'max_features': ['log2','auto','sqrt'],
    'min_samples_leaf': [2, 10, 30]
}

## Search for the optimal forest
In this exercise, you'll perform grid search using 3-fold cross validation to find rf's optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.

Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

grid_object.fit(X_train, y_train)
The untuned random forests regressor model rf as well as the dictionary params_rf that you defined in the previous exercise are available in your workspace.


* Import GridSearchCV from sklearn.model_selection.

* Instantiate a GridSearchCV object using 3-fold CV by using negative mean squared error as the scoring metric.

In [25]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

                       

In [26]:
# Fit grid_rf to the training set    
grid_rf.fit(X_train, y_train) 

Fitting 3 folds for each of 27 candidates, totalling 81 fits


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


In [27]:
# Extract best hyperparameters from grid_rf

best_hyperparams = grid_rf.best_params_

print('Best hyperparams:\n', best_hyperparams)

Best hyperparams:
 {'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 500}


## Evaluate the optimal forest
In this last exercise of the course, you'll evaluate the test set RMSE of grid_rf's optimal model.

The dataset is already loaded and processed for you and is split into 80% train and 20% test. In your environment are available X_test, y_test and the function mean_squared_error from sklearn.metrics under the alias MSE. In addition, we have also loaded the trained GridSearchCV object grid_rf that you instantiated in the previous exercise. Note that grid_rf was trained as follows:

grid_rf.fit(X_train, y_train)

* Import mean_squared_error as MSE from sklearn.metrics.

* Extract the best estimator from grid_rf and assign it to best_model.

* Predict best_model's test set labels and assign the result to y_pred.

* Compute best_model's test set RMSE.




In [28]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred) ** (1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 904.129
