# Grid Search

In this section we are trying to utilize GridSearchCV function provided with Scikit-Learn in order to find the optimal paramters for a Random Forest Model.

Please refer into following link to get some overall idea about GridSearchCV before prceeding into the excercise.

GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

The following link provide information about Random Forest Classifer and its usage in Scikit-Learn library.

RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Furthermore, the following dataset will be utilized for the following task.

Heart Disease Cleveland: https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland

In [6]:
# Load the necesary libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [4]:
# suppress warning messages

import warnings
warnings.filterwarnings('ignore') #suprress warnings that we dont need to be too concerned about

In [8]:
# Load the dataset as a Pandas dataframe and display the head

df = pd.read_csv('./Datasets/3-2-Dataset/Heart_disease_cleveland_new.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,0,145,233,1,2,150,0,2.3,2,0,2,0
1,67,1,3,160,286,0,2,108,1,1.5,1,3,1,1
2,67,1,3,120,229,0,2,129,1,2.6,1,2,3,1
3,37,1,2,130,250,0,0,187,0,3.5,2,0,1,0
4,41,0,1,130,204,0,2,172,0,1.4,0,0,1,0


In [9]:
# Check for the null values

df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [10]:
# Seperate the feature columns and targer using pandas functions

X = df.drop('target',axis=1)
y = df['target']

In [14]:
#shape of data features
X.shape

(303, 13)

In [15]:
#shape of the target column
y.shape

(303,)

In [16]:
# Split dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [17]:
# Print train dataset size

print(X_train.shape, y_train.shape)

(212, 13) (212,)


In [19]:
# Print test dataset size

print(X_test.shape,y_test.shape)

(91, 13) (91,)


In [23]:
# Scale the data using standard scaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [24]:
# Define the random forest classifier with the default paramters

rfc = RandomForestClassifier()

In [25]:
# Define the parameter grid for the grid search
# Refer to the GridSearchCV Documentation

forest_params = [{'max_depth': list(range(10,15)), 'max_features':list(range(0,14))}]

In [27]:
# Peform Grid Search to identify optimal parameters
# Use cv = 5

clf = GridSearchCV(rfc,forest_params, cv=5, scoring='accuracy', verbose=3)

In [28]:
clf.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 70 candidates, totalling 350 fits
[CV 1/5] END ........max_depth=10, max_features=0;, score=nan total time=   0.0s
[CV 2/5] END ........max_depth=10, max_features=0;, score=nan total time=   0.0s
[CV 3/5] END ........max_depth=10, max_features=0;, score=nan total time=   0.0s
[CV 4/5] END ........max_depth=10, max_features=0;, score=nan total time=   0.0s
[CV 5/5] END ........max_depth=10, max_features=0;, score=nan total time=   0.0s
[CV 1/5] END ......max_depth=10, max_features=1;, score=0.837 total time=   0.1s
[CV 2/5] END ......max_depth=10, max_features=1;, score=0.814 total time=   0.0s
[CV 3/5] END ......max_depth=10, max_features=1;, score=0.833 total time=   0.0s
[CV 4/5] END ......max_depth=10, max_features=1;, score=0.762 total time=   0.0s
[CV 5/5] END ......max_depth=10, max_features=1;, score=0.833 total time=   0.1s
[CV 1/5] END ......max_depth=10, max_features=2;, score=0.837 total time=   0.1s
[CV 2/5] END ......max_depth=10, max_features=2

In [29]:
# Print best hyperparameters detected from the Grid Search

clf.best_params_

{'max_depth': 12, 'max_features': 5}

In [30]:
# Print the mean cross-validated score of the best_estimator

clf.best_score_

0.8207087486157253

In [31]:
# Use best estimator to obtain the accuracy for the test set

print(clf.best_estimator_.score(X_train_scaled, y_train))
print(clf.best_estimator_.score(X_test_scaled, y_test))

1.0
0.8461538461538461
