## **Hyper parameter tuning - Random Forest**

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
rf = RandomForestClassifier()

In [36]:
import warnings
warnings.filterwarnings('ignore')

In [37]:
import pandas as pd
df = pd.read_csv('diabetes.csv')

In [38]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Outcome col 

- Is your Dependent feature 

- Remaining all are independent feature

In [39]:
# feature engineering

import numpy as np
df['Glucose'] = np.where(df['Glucose']== 0, df['Glucose'].median(),df['Glucose'])
df.head()
df['Insulin'] = np.where(df['Insulin']== 0, df['Insulin'].median(),df['Insulin'])
df.head()
df['SkinThickness'] = np.where(df['SkinThickness']== 0, df['SkinThickness'].median(),df['SkinThickness'])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35.0,30.5,33.6,0.627,50,1
1,1,85.0,66,29.0,30.5,26.6,0.351,31,0
2,8,183.0,64,23.0,30.5,23.3,0.672,32,1
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1


## **Do i need feature scaling in Random forest**

- In random forest we dont need scaling 
- Because Random forest works on Decision Tree 
- There will be only creating branches 
- it will not impact 



In [40]:
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

In [41]:
print(X.head())
print(y.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0             72           35.0     30.5  33.6   
1            1     85.0             66           29.0     30.5  26.6   
2            8    183.0             64           23.0     30.5  23.3   
3            1     89.0             66           23.0     94.0  28.1   
4            0    137.0             40           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


In [42]:
pd.DataFrame(X,columns=df.columns[:-1])


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72,35.0,30.5,33.6,0.627,50
1,1,85.0,66,29.0,30.5,26.6,0.351,31
2,8,183.0,64,23.0,30.5,23.3,0.672,32
3,1,89.0,66,23.0,94.0,28.1,0.167,21
4,0,137.0,40,35.0,168.0,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101.0,76,48.0,180.0,32.9,0.171,63
764,2,122.0,70,27.0,30.5,36.8,0.340,27
765,5,121.0,72,23.0,112.0,26.2,0.245,30
766,1,126.0,60,23.0,30.5,30.1,0.349,47


## Train test split

In [43]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=3)

In [44]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

## checking the dataset is an imbalance dataset

In [45]:
y.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

## Model Evaluation

In [46]:

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

[[81 11]
 [30 32]]
0.7337662337662337
              precision    recall  f1-score   support

           0       0.73      0.88      0.80        92
           1       0.74      0.52      0.61        62

    accuracy                           0.73       154
   macro avg       0.74      0.70      0.70       154
weighted avg       0.74      0.73      0.72       154



## **The main parameters used in RandomForestClassifier**

- criterion = the funtion used to evaluate the quality of a split
- max_depth = maximum number of levels allowed in each tree
- max_features = maximum numbers of features considered when spliting a node
- min_samples_leaf = minimum number of samples which can be stored in a tree leaf
- min_samples_split = minimum number of samples necessary in a node to cause node splitting.
- n_estimators = number of trees in the ensamble.

## If we try manuel hyper parameter tuning Worse way 

In [51]:
model=RandomForestClassifier(n_estimators=100,criterion='gini',
                             max_features='sqrt',min_samples_leaf=10,random_state=100).fit(X_train,y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

[[79 13]
 [30 32]]
0.7207792207792207
              precision    recall  f1-score   support

           0       0.72      0.86      0.79        92
           1       0.71      0.52      0.60        62

    accuracy                           0.72       154
   macro avg       0.72      0.69      0.69       154
weighted avg       0.72      0.72      0.71       154




## *Randomized Search CV vs Grid Search CV*
- Suppose I have 1 million people (A, B, C, D).

- Now I need to find a person (Vicky).

## *Grid Search CV*
- Will check every area: A, B, C, D, in that order.
- Thorough but time-consuming.
- Example: A, D, C, B.
## *Randomized Search CV*
- Randomly selects areas to search, e.g., B & C.
- Narrows down the search: B & C.
- Faster but might miss the target if not in selected areas.
- Still efficient even if you miss Vicky initially.
## *Hybrid Approach*
- First, use Randomized Search CV to narrow down the search areas.
- Then, use Grid Search CV within these narrowed-down areas.
- Ensures you find Vicky for sure.