## 1. RandomizedSearchCV
- Always use RandomizedsearchCV before GridSearchCV because RandomizedSearchCV will randomly select the parameters and check wheather they give better results or not; It just narrow down the parameters and then we can use GridSearchCV which thoroughly make and analyze each Permutation and combination and give results.

-  Model randomly makes combinations of its own and tries to fit the dataset and test the accuracy. Here, chances are there to miss on a few combinations which could have been optimal ones. Although, random search consumes quite less amount of time and most of the time it gives optimal solutions as well. So, in that case, it is a win-win situation.
### Parameters in RandomizedSearchcv()
------------------------------------------------------
1. **`estimator`** = name of model forwhich we perform tuning.
2. **`param_distributions`** = dictionary of parameters as key and list of respective parameters as value is given from which best values of parameters are found.

3. **`n_iter`** - No. of iteratons randomizedSearchCV should perform so that best parametrs are obatined from those n number of Permutations and combinations. As RandomizedSearchCV does not perform thorough search with every possible Permutation, it only perform given no. of iteraions and pickup values randomly.

4. **`cv`** - Number of cross validation are the number of times train test split taken place sequentially.

5. **`verbose`**- It gives the displayed logs.

6. **`random_state`** - seed for randomising and creating the same permutations and combinations again.

7. **`n_jobs`** - How many cores of machine to use for calculations. (-1 implies maximum cores) 


**No of fits = (cv)(n_iter)**

# 1. Importing necessary libraries and data, Preprocessing and Defining the Model

In [None]:
# Loading Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score 
import warnings
warnings.filterwarnings('ignore')

# Loading Data
df = pd.read_csv('preprocessed_diabetes.csv')

# Splitting into Features and Target
x = df.drop(["Outcome"], axis=1)
y = df["Outcome"]

# Splitting into Train Test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=33)

# Defining the RandomForest Classifier
rf = RandomForestClassifier()

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# 2. Creating the dictionary of hyperparameters

In [9]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 150,10)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]

params = { 'n_estimators': n_estimators,
           'max_features': max_features,
           'max_depth': max_depth,
           'min_samples_split': min_samples_split,
           'min_samples_leaf': min_samples_leaf,
           'criterion':['entropy','gini']}
print(params)

{'n_estimators': [200, 400, 600, 800, 1000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 25, 41, 56, 72, 87, 103, 118, 134, 150], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


# 3. Defining the RandomSearchCV with necessary parameters of function

In [10]:
rf_randomcv = RandomizedSearchCV(estimator=rf, param_distributions=params, n_iter=100, cv=2, verbose=2, refit=True)

rf_randomcv.fit(x_train, y_train)

Fitting 2 folds for each of 100 candidates, totalling 200 fits
[CV] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=600; total time=   1.6s
[CV] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=600; total time=   1.6s
[CV] END criterion=entropy, max_depth=118, max_features=auto, min_samples_leaf=6, min_samples_split=14, n_estimators=200; total time=   0.5s
[CV] END criterion=entropy, max_depth=118, max_features=auto, min_samples_leaf=6, min_samples_split=14, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=150, max_features=sqrt, min_samples_leaf=6, min_samples_split=2, n_estimators=1000; total time=   2.4s
[CV] END criterion=gini, max_depth=150, max_features=sqrt, min_samples_leaf=6, min_samples_split=2, n_estimators=1000; total time=   2.2s
[CV] END criterion=gini, max_depth=103, max_features=log2, min_samples_leaf=4, min_samples_split=14, 

[CV] END criterion=gini, max_depth=87, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=   2.3s
[CV] END criterion=entropy, max_depth=103, max_features=sqrt, min_samples_leaf=6, min_samples_split=2, n_estimators=800; total time=   1.7s
[CV] END criterion=entropy, max_depth=103, max_features=sqrt, min_samples_leaf=6, min_samples_split=2, n_estimators=800; total time=   1.8s
[CV] END criterion=gini, max_depth=41, max_features=auto, min_samples_leaf=8, min_samples_split=10, n_estimators=600; total time=   1.2s
[CV] END criterion=gini, max_depth=41, max_features=auto, min_samples_leaf=8, min_samples_split=10, n_estimators=600; total time=   1.2s
[CV] END criterion=gini, max_depth=103, max_features=auto, min_samples_leaf=8, min_samples_split=2, n_estimators=400; total time=   0.9s
[CV] END criterion=gini, max_depth=103, max_features=auto, min_samples_leaf=8, min_samples_split=2, n_estimators=400; total time=   1.1s
[CV] END criterion=gini, max_depth=

[CV] END criterion=entropy, max_depth=56, max_features=log2, min_samples_leaf=6, min_samples_split=5, n_estimators=600; total time=   1.3s
[CV] END criterion=entropy, max_depth=103, max_features=sqrt, min_samples_leaf=8, min_samples_split=10, n_estimators=800; total time=   1.7s
[CV] END criterion=entropy, max_depth=103, max_features=sqrt, min_samples_leaf=8, min_samples_split=10, n_estimators=800; total time=   1.7s
[CV] END criterion=entropy, max_depth=72, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=800; total time=   2.1s
[CV] END criterion=entropy, max_depth=72, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=800; total time=   1.7s
[CV] END criterion=gini, max_depth=10, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=gini, max_depth=10, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=gini, ma

[CV] END criterion=entropy, max_depth=41, max_features=auto, min_samples_leaf=1, min_samples_split=14, n_estimators=1000; total time=   2.5s
[CV] END criterion=gini, max_depth=103, max_features=log2, min_samples_leaf=1, min_samples_split=14, n_estimators=600; total time=   1.3s
[CV] END criterion=gini, max_depth=103, max_features=log2, min_samples_leaf=1, min_samples_split=14, n_estimators=600; total time=   1.1s
[CV] END criterion=gini, max_depth=87, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=gini, max_depth=87, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=entropy, max_depth=41, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=entropy, max_depth=41, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.3s
[CV] END criterion=entropy, max_d

RandomizedSearchCV(cv=2, estimator=RandomForestClassifier(), n_iter=100,
                   param_distributions={'criterion': ['entropy', 'gini'],
                                        'max_depth': [10, 25, 41, 56, 72, 87,
                                                      103, 118, 134, 150],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [2, 5, 10, 14],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000]},
                   verbose=2)

In [11]:
rf_randomcv.best_params_

{'n_estimators': 200,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 118,
 'criterion': 'entropy'}

In [12]:
y_pred = rf_randomcv.predict(x_test)

## Evaluation of RandomSearchCV

In [13]:
rf_randomcv.score(x_test, y_test)

0.7662337662337663

In [14]:
confusion_matrix(y_test, y_pred)

array([[87, 12],
       [24, 31]], dtype=int64)

In [15]:
cf = classification_report(y_test, y_pred)
print(cf)

              precision    recall  f1-score   support

           0       0.78      0.88      0.83        99
           1       0.72      0.56      0.63        55

    accuracy                           0.77       154
   macro avg       0.75      0.72      0.73       154
weighted avg       0.76      0.77      0.76       154



## 2. GridSearchCV 

- RandomSearchCV gives the best values by randomly selecting the parameters, so we Search the parameters throughly using GridSearchCV.

- The List of paramters is given which areclose to the best parameters obtained from RandomSearchCV 

- We don't have number of iteration parameter in GridSearchcv as it Thoroughly checks all th possible permutations and combinations.

- **`param_distributions`** is replaced with **`param_grid`**

### Parameters in GridSearchcv()
------------------------------------------------------
1. **`estimator`** = name of model forwhich we perform tuning.

2. **`param_grid`** = dictionary of parameters as key and list of respective parameters as value is given from which best values of parameters are found.


4. **`cv`** - Number of cross validation are the number of times train test split taken place sequentially.

5. **`verbose`**- It gives the displayed logs.

6. **`random_state`** - seed for randomising and creating the same permutations and combinations again.

7. **`n_jobs`** - How many cores of machine to use for calculations. (-1 implies maximum cores) 


**No of fits = (Summation of Number of Elements in every parameters list)* CV**


In [16]:
from sklearn.model_selection import GridSearchCV

para = rf_randomcv.best_params_

param_grid = {
    'criterion': [para['criterion']],
    
    'max_depth': [para['max_depth']],
    
    'max_features': [para['max_features']],
    
    'min_samples_leaf': [para['min_samples_leaf'], 
                         para['min_samples_leaf']+2, 
                         para['min_samples_leaf'] + 4],
    
    'min_samples_split': [para['min_samples_split'] - 2,
                          para['min_samples_split'] - 1,
                          para['min_samples_split'], 
                          para['min_samples_split'] +1,
                          para['min_samples_split'] + 2],
    
    'n_estimators': [para['n_estimators'] - 200,
                     para['n_estimators'] - 100, 
                     para['n_estimators'], 
                     para['n_estimators'] + 100, 
                     para['n_estimators'] + 200]
}
print(param_grid)

{'criterion': ['entropy'], 'max_depth': [118], 'max_features': ['auto'], 'min_samples_leaf': [4, 6, 8], 'min_samples_split': [8, 9, 10, 11, 12], 'n_estimators': [0, 100, 200, 300, 400]}


In [17]:
rf = RandomForestClassifier()

rf_gridcv = GridSearchCV(estimator = rf, param_grid= param_grid, cv=10, n_jobs=-1, verbose=2, refit=True)

rf_gridcv.fit(x_train, y_train)

Fitting 10 folds for each of 75 candidates, totalling 750 fits


GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['entropy'], 'max_depth': [118],
                         'max_features': ['auto'],
                         'min_samples_leaf': [4, 6, 8],
                         'min_samples_split': [8, 9, 10, 11, 12],
                         'n_estimators': [0, 100, 200, 300, 400]},
             verbose=2)

In [18]:
y_pred = rf_gridcv.predict(x_test)

In [19]:
rf_gridcv.best_params_

{'criterion': 'entropy',
 'max_depth': 118,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 12,
 'n_estimators': 100}

## Evaluation of GridSearchCV Model

In [20]:
rf_gridcv.score(x_train, y_train)

0.9234527687296417

In [21]:
confusion_matrix(y_test, y_pred)

array([[85, 14],
       [23, 32]], dtype=int64)

In [22]:
cf = classification_report(y_test, y_pred)
print(cf)

              precision    recall  f1-score   support

           0       0.79      0.86      0.82        99
           1       0.70      0.58      0.63        55

    accuracy                           0.76       154
   macro avg       0.74      0.72      0.73       154
weighted avg       0.75      0.76      0.75       154

