### Objective : 

The objective of this case study is demonstrate a technique called **Cross Validation assisted Grid Search** which is basically used for determining the best combination of values of hyperparameters of a classification model. For different combinations of hyperparameter values fed to a model, this technique yields a combination of hyperparameter values corresponding to which the performance of the model is the best. 

In this case study, the best combination of hyperparameters for Logistic Regression Classifier will be arrived at using **Cross Validation assisted Grid Search**.

The dataset that will be used for demonstration is popular by the name of **US adult income** dataset. The classification goal, in the context of the dataset would be to predict whether or not the annual income corresponding to an instance, for a given combination of values of its attributes is greater than 50,000 dollars.

### Data :

#### Data Source : https://www.kaggle.com/wenruliu/adult-income-dataset

#### Attributes of the dataset :

***Input Features*** :

1) age : Continuous

2) workclass : Categorical

3) fnlwgt : Continuous : Number of instances in the original dataset that have exactly the same set of values

4) education : Categorical

5) Education-num : Continuous

6) marital-status : Categorical

7) occupation : Categorical

8) relationship : Categorical

9) race : Categorical

10) sex : Categorical

11) capital-gain : Categorical

12) capital-loss : Categorical

13) Hours-per-week : Categorical

14) country : Categorical

***Target Feature***:

Income : Categorical: lesser than or greater than 50k?

#### 1) Importing the relevant libraries: 

In [1]:
import pandas as pd
import numpy as np

#### 2) Loading the dataset :

In [2]:
income_data=pd.read_csv('adult.csv',skipinitialspace=True,na_values='?')

#### 3) Checking the dataset for missing values :

In [3]:
income_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          46043 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         46033 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     47985 non-null object
income             48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


The number of 'non-null' values is not uniform across the attributes,therefore we will remove observations/instances
with missing values.

#### 4) Getting rid of repetitive attributes of the dataset :

The attribute **education_num** is a numerical version of the attribute **education**, we will thus get rid of the attribute **education**.

In [4]:
income_data=income_data.drop(labels='education',axis=1)
income_data.head(2)

Unnamed: 0,age,workclass,fnlwgt,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K


#### 5) Removing instances with missing values :

In [5]:
income_data.dropna(inplace=True)
income_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45222 entries, 0 to 48841
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
fnlwgt             45222 non-null int64
educational-num    45222 non-null int64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
gender             45222 non-null object
capital-gain       45222 non-null int64
capital-loss       45222 non-null int64
hours-per-week     45222 non-null int64
native-country     45222 non-null object
income             45222 non-null object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


#### 6) Data Preprocessing :

#### 6.1) Categorical Features and their indices:

In [6]:
categorical_feature=income_data.select_dtypes(include=object).columns
categorical_index=[]
for feature in categorical_feature:
    categorical_index.append(income_data.columns.get_loc(feature))
feature_index=pd.DataFrame(data={'feature':categorical_feature,'index':categorical_index})
feature_index

Unnamed: 0,feature,index
0,workclass,1
1,marital-status,4
2,occupation,5
3,relationship,6
4,race,7
5,gender,8
6,native-country,12
7,income,13


#### 6.2) Label Encoding the categorical attributes :

In [7]:
#fetching  categorical features from among the input features the dataset  :
categorical_attributes=income_data.loc[:,'age':'native-country'].select_dtypes(include=object).columns

#determining the indices of the categorical features:
categorical_indices=[]
for attribute in categorical_attributes:
    categorical_indices.append(income_data.loc[:,'age':'native-country'].columns.get_loc(attribute))

#label encoding the input categorical features:
from sklearn.preprocessing import LabelEncoder
encoder_object=LabelEncoder()
for count in categorical_indices:
    income_data.iloc[:,count]=encoder_object.fit_transform(income_data.iloc[:,count])
print('categorical indices :',categorical_indices)

#label encoding the target feature:
income_data.iloc[:,13]=encoder_object.fit_transform(income_data.iloc[:,13])

categorical indices : [1, 4, 5, 6, 7, 8, 12]


In [8]:
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,2,226802,7,4,6,3,2,1,0,0,40,38,0
1,38,2,89814,9,2,4,0,4,1,0,0,50,38,0
2,28,1,336951,12,2,10,0,4,1,0,0,40,38,1
3,44,2,160323,10,2,6,0,2,1,7688,0,40,38,1
5,34,2,198693,6,4,7,1,4,1,0,0,30,38,0


#### 6.3) One Hot Encoding the categorical attributes among the input features  :

In [9]:
from sklearn.preprocessing import OneHotEncoder
hot_encoder=OneHotEncoder(categorical_features=categorical_indices)
income_data=hot_encoder.fit_transform(income_data).toarray()
income_data=pd.DataFrame(data=income_data)
income_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,79,80,81,82,83,84,85,86,87,88
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,25.0,226802.0,7.0,0.0,0.0,40.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,38.0,89814.0,9.0,0.0,0.0,50.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,28.0,336951.0,12.0,0.0,0.0,40.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,44.0,160323.0,10.0,7688.0,0.0,40.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,34.0,198693.0,6.0,0.0,0.0,30.0,0.0


#### 6.4) Inspecting the number of splits rendered to each categorical feature  by one hot encoding:

In [10]:
column_splits=pd.DataFrame(data={'INDICES':categorical_indices,'SPLITS':hot_encoder.n_values_})
column_splits

Unnamed: 0,INDICES,SPLITS
0,1,7
1,4,7
2,5,14
3,6,6
4,7,5
5,8,2
6,12,41


#### 7) Checking for class imbalance :

In [11]:
income_data[88].value_counts()

0.0    34014
1.0    11208
Name: 88, dtype: int64

It can be clearly seen that the dataset we have, is infested with a severe class imbalance. Training a classifier on such a highly imbalanced dataset often causes the classifier to perform extremely well, as far as assigning class labels to instnaces of majority class is concerned, but the classifier performs extremely poorly as far assigning class labels to instances of the scarce class is concerned. This problem can be resolved by training the classifier on a dataframe obtained by sampling an equal number of instances of either class from the original dataset. In the subsequent section we will use this approach to train our classifier.

#### 8) Sampling an equal number of instances of either class and creating a new dataframe using those instances:

In [12]:
#sampling an equal number of instances of either class from the main dataset
income_positive=income_data[income_data[88]==0.0].sample(n=3500,replace=False)
income_negative=income_data[income_data[88]==1.0].sample(n=3500,replace=False)

#creating a dataset from those instaces
training_data=pd.concat([income_positive,income_negative])
training_data=training_data.reindex(np.random.permutation(training_data.index))

#dropping the rows of training data from the main datset
testing_data=income_data.drop(training_data.index)

#splitting the training data into a matrix of input features and corresponding vector of target feature.
X_train=training_data.iloc[:,0:88]
Y_train=training_data.iloc[:,88]

#splitting the testing data into a matrix of input features and corresponding vector of target feature.
X_test=testing_data.iloc[:,0:88]
Y_test=testing_data.iloc[:,88]

#### 9) Standardizing the training set and the test set :

In [13]:
from sklearn.preprocessing import StandardScaler
standardizer=StandardScaler()
X_train=standardizer.fit_transform(X_train)
X_test=standardizer.transform(X_test)

#### 10) Applying principal component analysis (PCA) to reduce the dimensionality of the data :

In [14]:
from sklearn.decomposition import PCA
pca_object=PCA(0.90)
X_train=pca_object.fit_transform(X_train)
X_test=pca_object.transform(X_test)
print('no_of_principal_components:',pca_object.n_components_)
print('explained_variance_ratio:',pca_object.explained_variance_ratio_)

no_of_principal_components: 67
explained_variance_ratio: [0.05284585 0.02982293 0.02867989 0.02282749 0.02119262 0.02034241
 0.01888244 0.01655008 0.0157687  0.0147243  0.0144925  0.01404523
 0.01391171 0.0136471  0.01342062 0.01329539 0.01307106 0.01297128
 0.01277399 0.0127089  0.01264268 0.01252103 0.01234434 0.01227586
 0.01221982 0.01218058 0.01205533 0.01186592 0.01178766 0.01172788
 0.01170132 0.01153969 0.01149583 0.01145982 0.01142093 0.01140528
 0.01139713 0.01138834 0.01138689 0.01138421 0.01137879 0.01137702
 0.01137247 0.01136629 0.01136216 0.01134392 0.0113238  0.01131886
 0.01127508 0.01122944 0.01116608 0.01114663 0.01109538 0.01103367
 0.01088749 0.01081871 0.01075847 0.0107004  0.0106433  0.0105832
 0.01043146 0.01028241 0.01021695 0.01015178 0.01008804 0.00995053
 0.00984962]


#### 11) Demonstrating cross-validation assisted hyper-parameter tuning on LogisticRegression classifier using GridSearchCV() :

#### 11.1) Tuning the hyper-parameters :

1) Estimator used : LogisticRegression classifier.

2) Parameters tuned :'penalty', 'C', 'solver'.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_comb1={'penalty':['l1'],'C':[0.15,0.30,0.45,0.60,0.75,0.90,1],'solver':['liblinear','saga']}
param_comb2={'penalty':['l2'],'C':[0.15,0.30,0.45,0.60,0.75,0.90,1],'solver':['newton-cg','sag','lbfgs']}
hyperparams=[param_comb1,param_comb2]
grid_search_object=GridSearchCV(estimator=LogisticRegression(),param_grid=hyperparams,cv=15,scoring='accuracy',n_jobs=-1)
grid_search_object.fit(X_train,Y_train)

#obtaining the best combination of hyper-parameters
print('Best Parameters',grid_search_object.best_params_)

Best Parameters {'C': 0.9, 'penalty': 'l2', 'solver': 'newton-cg'}


#### 11.2) Performance of the Classifier with  tuned hyperparameters :

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score
logreg_clf=LogisticRegression(C=1,penalty='l2',solver='lbfgs')
logreg_clf.fit(X_train,Y_train)
Y_pred=logreg_clf.predict(X_test)
CM=confusion_matrix(Y_test,Y_pred)
print('CONFUSION_MATRIX:\n',confusion_matrix(Y_test,Y_pred))
print('SPECIFICITY:',100*(CM[0,0]/(CM[0,0]+CM[0,1])))
print('SENSITIVITY:',100*(CM[1,1]/(CM[1,0]+CM[1,1])))
print('ACCURACY ON TEST SET:',100*accuracy_score(Y_test,Y_pred))
print('ACCURACY ON TRAINING SET:',100*accuracy_score(Y_train,logreg_clf.predict(X_train)))


CONFUSION_MATRIX:
 [[23604  6910]
 [ 1274  6434]]
SPECIFICITY: 77.35465687880972
SENSITIVITY: 83.47171769590037
ACCURACY ON TEST SET: 78.58824760609073
ACCURACY ON TRAINING SET: 81.27142857142857


#### 11.3) Performance of the classifier without tuned hyperparameters:

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score
logreg_clf=LogisticRegression()
logreg_clf.fit(X_train,Y_train)
Y_pred=logreg_clf.predict(X_test)
CM=confusion_matrix(Y_test,Y_pred)
print('CONFUSION_MATRIX:\n',confusion_matrix(Y_test,Y_pred))
print('SPECIFICITY:',100*(CM[0,0]/(CM[0,0]+CM[0,1])))
print('SENSITIVITY:',100*(CM[1,1]/(CM[1,0]+CM[1,1])))
print('ACCURACY ON TEST SET:',100*accuracy_score(Y_test,Y_pred))
print('ACCURACY ON TRAINING SET:',100*accuracy_score(Y_train,logreg_clf.predict(X_train)))

CONFUSION_MATRIX:
 [[23602  6912]
 [ 1272  6436]]
SPECIFICITY: 77.34810251032313
SENSITIVITY: 83.49766476388169
ACCURACY ON TEST SET: 78.58824760609073
ACCURACY ON TRAINING SET: 81.25714285714287


We thus observe that the performance of the classifier with its hyper-parameters tuned is slightly better than that with its hyperparameters untuned.