# Hyperparameter Tuning

## Random SearchCV

RandomSearchCV can be helpful for hyperparameters tuning when you have very wide range of values in which you have to search for optimal values of your hyperparameters. It will help to reduce time it takes to search for optimal values but disadvantage is sometimes u might hit optimal values exactly or many times get close to optimal values but not exactly the original optimal values. 


# * Note (Hybrid Approach) :- 1st try out RandomizedSearchCV to know the vicinity of combinations and then based on  RandomizedSearchCV output, reduce the range  & try out GridSearchCV to know the actual optimal values.

* Problem with GridSearchCV is that the number of combinations will increase with multiple hyperparameters
* With multiple hyperparameters through GridSearchCV, whole process will take lot of time with large data 
* Due to such problems we use RandomizedSearchCV

**RandomizedSearchCV** : Instead of making combinations with all the hyperparameters, it will select few of the combinations and does hyperparameter tuning.
* Due to which no of combinations are less and will yeild optimal results faster

- With randomised search you will know in what viscinity of values u have optimal results coming in.

In [12]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_validate
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression


In [2]:
data = pd.read_csv('/Users/sylvia/Desktop/datasets/insurance.csv')
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [3]:
# Binary Variables - sex and smoker
data['sex'] = data['sex'].replace({'female':1, 'male':0})
data['smoker'] = data['smoker'].replace({'yes':1, 'no':0})
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,1,27.9,0,1,southwest,16884.92
1,18,0,33.8,1,0,southeast,1725.55
2,28,0,33.0,3,0,southeast,4449.46
3,33,0,22.7,0,0,northwest,21984.47
4,32,0,28.9,0,0,northwest,3866.86


In [4]:
# Multiclass variables - region
data_ohe = pd.get_dummies(data)
data_ohe = data_ohe.reindex(columns = [col for col in data_ohe.columns if col != 'expenses'] + ['expenses'])
data_ohe.head()

Unnamed: 0,age,sex,bmi,children,smoker,region_northeast,region_northwest,region_southeast,region_southwest,expenses
0,19,1,27.9,0,1,0,0,0,1,16884.92
1,18,0,33.8,1,0,0,0,1,0,1725.55
2,28,0,33.0,3,0,0,0,1,0,4449.46
3,33,0,22.7,0,0,0,1,0,0,21984.47
4,32,0,28.9,0,0,0,1,0,0,3866.86


**Split into Features and Target**

In [5]:
X = data_ohe.drop('expenses', axis = 1)
y = data_ohe['expenses']

In [6]:
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size = .2, random_state=3)
xtrain.shape, ytrain.shape, xtest.shape, ytest.shape

((1070, 9), (1070,), (268, 9), (268,))

**Modelling using RandomizedSearchCV**

In [13]:
# Although range is higher here (2,50) but the combinations in which it is exploring the 
# optimal value is less as it is randomly selecting the combinations.

params1 = {'max_depth':list(range(2,50)), 'min_samples_split':[5,10,15,20,25,30]}

dtr = DecisionTreeRegressor()

dtr_rs = RandomizedSearchCV(estimator=dtr,
                            param_distributions=params1,
                            scoring='r2', 
                            cv=5).fit(xtrain, ytrain)


In [14]:
dtr_rs.best_params_, dtr_rs.best_score_

({'min_samples_split': 25, 'max_depth': 7}, 0.8334201385198703)

Change dtr_rs.cv_results_ from dictionary to dataframe for better understanding of result.

In [15]:
df_cv_results = pd.DataFrame(dtr_rs.cv_results_)
df_cv_results.head()


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_split,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.004848,0.000734,0.001962,0.000537,15,16,"{'min_samples_split': 15, 'max_depth': 16}",0.850398,0.805958,0.786564,0.79262,0.781189,0.803346,0.024933,8
1,0.00304,0.000356,0.001263,0.000165,30,41,"{'min_samples_split': 30, 'max_depth': 41}",0.874524,0.840528,0.807294,0.837733,0.795084,0.831032,0.027865,2
2,0.002169,0.000181,0.000923,5.7e-05,30,34,"{'min_samples_split': 30, 'max_depth': 34}",0.874524,0.840528,0.807294,0.837733,0.795084,0.831032,0.027865,2
3,0.001794,0.000123,0.000766,4.9e-05,30,40,"{'min_samples_split': 30, 'max_depth': 40}",0.874524,0.840528,0.807294,0.837733,0.795084,0.831032,0.027865,2
4,0.001691,3.1e-05,0.000714,1.1e-05,25,46,"{'min_samples_split': 25, 'max_depth': 46}",0.86818,0.82864,0.797667,0.826872,0.799745,0.824221,0.025547,7


To see which combinations function has tried out since this is random search so few combinations out of all possible are randomly picked, use cv_results['params']

In [11]:
df_cv_results[['params']]

Unnamed: 0,params
0,"{'min_samples_split': 10, 'max_depth': 30}"
1,"{'min_samples_split': 10, 'max_depth': 48}"
2,"{'min_samples_split': 15, 'max_depth': 7}"
3,"{'min_samples_split': 25, 'max_depth': 32}"
4,"{'min_samples_split': 30, 'max_depth': 40}"
5,"{'min_samples_split': 5, 'max_depth': 30}"
6,"{'min_samples_split': 25, 'max_depth': 25}"
7,"{'min_samples_split': 25, 'max_depth': 45}"
8,"{'min_samples_split': 20, 'max_depth': 5}"
9,"{'min_samples_split': 25, 'max_depth': 31}"
