#### Objective : 
The objective of this case study is to demonstrate how the prediction accuracy of the Random Forest Classifier be enhanced through grid serach assisted Hyperparameter tuning GridSearchCV().

#### Data Source : https://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data

#### About the dataset :

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

#### Input Attributes :
1. Wife's age (numerical) 
2. Wife's education (categorical) 1=low, 2, 3, 4=high 
3. Husband's education (categorical) 1=low, 2, 3, 4=high 
4. Number of children ever born (numerical) 
5. Wife's religion (binary) 0=Non-Islam, 1=Islam 
6. Wife's now working? (binary) 0=Yes, 1=No 
7. Husband's occupation (categorical) 1, 2, 3, 4 
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high 
9. Media exposure (binary) 0=Good, 1=Not good 

#### Target Attribute :

1. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term

#### 1) Importing the relevant libraries :

In [1]:
import pandas as pd
import numpy as np

#### 2) Loading and formatting the dataframe :

In [2]:
columns=['wife_age','wife_education','husband_education','children_count','wife_religion','wife_work_status','husband_occupation',
     'standard_of_living_ix','media_exposure','contraceptive']
contraceptive_data=pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data',sep=',',header=None,names=columns)


In [3]:
contraceptive_data.head()

Unnamed: 0,wife_age,wife_education,husband_education,children_count,wife_religion,wife_work_status,husband_occupation,standard_of_living_ix,media_exposure,contraceptive
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


In [4]:
contraceptive_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
wife_age                 1473 non-null int64
wife_education           1473 non-null int64
husband_education        1473 non-null int64
children_count           1473 non-null int64
wife_religion            1473 non-null int64
wife_work_status         1473 non-null int64
husband_occupation       1473 non-null int64
standard_of_living_ix    1473 non-null int64
media_exposure           1473 non-null int64
contraceptive            1473 non-null int64
dtypes: int64(10)
memory usage: 115.2 KB


#### 3) Splitting the data into input variables and target variables: 
   

In [5]:
contraceptive_data
X=contraceptive_data.loc[:,'wife_age':'media_exposure'].values
Y=contraceptive_data.loc[:,'contraceptive'].values

#### 4) Splitting the data further into training set and testing set:

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=45,test_size=200)

#### 5) Applying Cross Validation assisted GridSearch to chosse the best-hyper parameters for the Random Forest Classifier:

In this stage we tune the parameters of the RandomForestClassifier (RFC) and determine among the combination of parameters fed to the  GridSearchCV object, which combination corresponds to the best performance of RandomForestClassifier on the given dataset. The parameters that we shall combine  are, number of estimators used in the ensemble, criterion, max features to consider, maximum depth of each tree and the minimum samples  at which a leaf should be further split.

In [7]:
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import GridSearchCV 
hyperparams={'n_estimators':[25,30,35,40],
            'criterion':['gini','entropy'],
            'max_features':[6,7,8,9],
            'max_depth':[6,7,8,9,10,11],
            'min_samples_split':[2,3,4,5,6]}
grid_search_object=GridSearchCV(estimator=RFC(),param_grid=hyperparams,cv=10,scoring='accuracy',n_jobs=-1,verbose=3)
grid_search_object.fit(X_train,Y_train)

Fitting 10 folds for each of 960 candidates, totalling 9600 fits


[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 167 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 487 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 935 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done 1511 tasks      | elapsed:   26.4s
[Parallel(n_jobs=-1)]: Done 2215 tasks      | elapsed:   39.6s
[Parallel(n_jobs=-1)]: Done 3047 tasks      | elapsed:   56.0s
[Parallel(n_jobs=-1)]: Done 4007 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 5095 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 6311 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 7655 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 9127 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 9585 out of 9600 | elapsed:  3.0min remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 9600 out of 9600 | elapsed:  3.0min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [25, 30, 35, 40], 'criterion': ['gini', 'entropy'], 'max_features': [6, 7, 8, 9], 'max_depth': [6, 7, 8, 9, 10, 11], 'min_samples_split': [2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=3)

#### 6) Extracting the most optimal set of hyperparameters :

In [8]:
grid_search_object.best_params_

{'criterion': 'gini',
 'max_depth': 7,
 'max_features': 9,
 'min_samples_split': 5,
 'n_estimators': 25}

#### 7) Evaluating the performance of Random Forest Classifier with untuned parameters.

In [9]:
from sklearn.metrics import confusion_matrix
rfc_clf=RFC()
rfc_clf.fit(X_train,Y_train)
cm=confusion_matrix(Y_test,rfc_clf.predict(X_test))
print('Confusion matrix before Hyperparameter_Optimization:\n',cm)
print('Accuracy for class 0:',100*cm[0,0]/(cm[0,0]+cm[0,1]+cm[0,2]))
print('Accuracy for class 1:',100*cm[1,1]/(cm[1,0]+cm[1,1]+cm[1,2]))
print('Accuracy for class 2:',100*cm[2,2]/(cm[2,0]+cm[2,1]+cm[2,2]))

Confusion matrix before Hyperparameter_Optimization:
 [[60 10 16]
 [ 8 16 17]
 [23  7 43]]
Accuracy for class 0: 69.76744186046511
Accuracy for class 1: 39.02439024390244
Accuracy for class 2: 58.9041095890411


#### 8) Evaluating the performance of Random Forest Classifier with tuned parameters.

In [11]:
final_clf=RFC(n_estimators=25,criterion='gini',max_depth=7,max_features=9,min_samples_split=5)
final_clf.fit(X_train,Y_train)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,final_clf.predict(X_test))
print('Confusion Matrix after Hyperparameter_Optimization:\n',cm)
print('Accuracy for class 0:',100*cm[0,0]/(cm[0,0]+cm[0,1]+cm[0,2]))
print('Accuracy for class 1:',100*cm[1,1]/(cm[1,0]+cm[1,1]+cm[1,2]))
print('Accuracy for class 2:',100*cm[2,2]/(cm[2,0]+cm[2,1]+cm[2,2]))

Confusion Matrix after Hyperparameter_Optimization:
 [[65  4 17]
 [ 8 16 17]
 [15  9 49]]
Accuracy for class 0: 75.5813953488372
Accuracy for class 1: 39.02439024390244
Accuracy for class 2: 67.12328767123287


From the above results it stands proven that the accuracy on individual class labels is better for the model which has its parameters tuned as opposed to the one which has untuned parameters.