### Objective : 

The objective of this case study is demonstrate a technique called **Voting based Classification**, which basically involves aggregating the results of a relatively handful of predictors each built upon a separate classification algorithm and trained on exactly the same training dataset,to predict the class of an instance. The results could be aggregated in the framework of either *Soft Voting* scheme : one in which, for a particular instance of the test set, the probabilities of prediction of a particular class label are averaged across all the classifiers and the class label that corresponds to the highest value is assigned to the instance or *Hard Voting* scheme : one in which, for an instance of the test set, the class label that has been predicted by the maximum number of predictors is assigned. All the predictors are trained in parallel.


The dataset that will be used for demonstration is popular by the name of **US adult income** dataset. The classification goal, in the context of the dataset, would be to predict whether or not the annual income corresponding to an instance, for a given combination of values of its attributes is greater than 50,000 dollars.

### Data :

#### Data Source : https://www.kaggle.com/wenruliu/adult-income-dataset

#### Attributes of the dataset :

***Input Features*** :

1) age : Continuous

2) workclass : Categorical

3) fnlwgt : Continuous : Number of instances in the original dataset that have exactly the same set of values

4) education : Categorical

5) Education-num : Continuous

6) marital-status : Categorical

7) occupation : Categorical

8) relationship : Categorical

9) race : Categorical

10) sex : Categorical

11) capital-gain : Categorical

12) capital-loss : Categorical

13) Hours-per-week : Categorical

14) country : Categorical

***Target Feature***:

Income : Categorical: lesser than or greater than 50k?

#### 1) Importing the relevant libraries :

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


#### 2) Loading the dataset :

In [2]:
income_data=pd.read_csv('adult.csv',sep=',',skipinitialspace=True,na_values='?')
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


#### 3) Checking for class imbalance:

In [3]:
income_data['income'].value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

#### 4) Checking for missing values :

In [4]:
income_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          46043 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         46033 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     47985 non-null object
income             48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


#### 5) Label Encoding all the features:

Label Encoding is the process of assigning numerical labels to values contained within categorical input features of the
dataframe. Label Encoding is performed inorder to facilitate the application of predictive mathematical models such
as **Logistic Regression**, **Support Vector Machines**, **Naive-Baye's** etc, to those datasets which contain categorical/non-numerical data. Label Encoding is performed in two stages which are as follows:

1) The Categorical attributes have to be fetched from the main dataframe.

2) The values contained within the categorical attributes have been assigned numerical labels.

In [5]:
#getting rid of duplicated columns:
income_data=income_data.drop(labels='education',axis=1)

#getting rid of rows containing missing values:
income_data.dropna(inplace=True)

#fetching categorical indices
categorical_features=income_data.loc[:,'age':'income'].select_dtypes(include=object).columns
categorical_indices=[]
for feature in categorical_features:
    categorical_indices.append(income_data.columns.get_loc(feature))

#label encoding the categorical features:
from sklearn.preprocessing import LabelEncoder
encoder_object=LabelEncoder()
for count in categorical_indices:
    income_data.iloc[:,count]=encoder_object.fit_transform(income_data.iloc[:,count])

#displaying the categorical indices:
categorical_indices

[1, 4, 5, 6, 7, 8, 12, 13]

#### 6) OneHotEncoding the LabelEncoded input features :

In order to facilitate the application of mathamatical models to datasets, merely assigning numerical labels to categorical attributes is simply not enough. One must remember that the assigned numerical labels are not related to each other in an ordinal sense, therefore we use a technique called 'OneHotEncoding' which, what basically does is, the following :

A column representing a categorical attribute is split into multiple columns such that we have new columns equal to the number of all the numerical labels used for encoding the values contained within the column under consideration. Inorder to expand upon what has just been stated, consider the following, the column of the dataframe named 'job' contains 41118 values, these 41118 values have been assigned numerical labels using integers from 0 to 12 i.e 13 integers. We will now split the 'job' column into 13 columns and each of the columns will represent an integer from 0 to 12.

For a particular observation (row index) if the job is encoded with a label '3', it will reflect in the newly created columns in the following way, the column that reprsents label '3' will be assigned 1 whereas rest of the columns will be assigned '0' and so on. This holds true for all the encoded categorical columns.

To sum up 'OneHotEncoding' can be described as the process of assigning a binary sequence of a particular 'length' to each value conatined within a 'LabelEncoded' attribute. The 'length'of the binary sequence is equal to the number of numerical labels used to represent the different values contained within a categorical column.

**CAUTION!!! : WE MUST REFRAIN FROM 'OneHotEncoding' THE TARGET ATTRIBUTE**

In [6]:
from sklearn.preprocessing import OneHotEncoder
hot_encoder=OneHotEncoder(categorical_features=[[1,4,5,6,7,8,12]])
income_data=hot_encoder.fit_transform(income_data).toarray()
income_data=pd.DataFrame(data=income_data)

In [7]:
income_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,79,80,81,82,83,84,85,86,87,88
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,25.0,226802.0,7.0,0.0,0.0,40.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,38.0,89814.0,9.0,0.0,0.0,50.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,28.0,336951.0,12.0,0.0,0.0,40.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,44.0,160323.0,10.0,7688.0,0.0,40.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,34.0,198693.0,6.0,0.0,0.0,30.0,0.0


In [8]:
income_data[88].value_counts()

0.0    34014
1.0    11208
Name: 88, dtype: int64

#### 7) Creating a training dataset by drawinng an equal number of instances of the either class :

Since the dataset we have is infested with a high degree of class imabalnce, therefore to avoid the classifier from predicting well on only those instances that belong to the abundant class, we draw an equal number of instances of either class from the main dataset and use those instances to train the classifier. 

Once the instances have been drawn from the main datset, its remaining instances are used to test the perfromance of the classifier.Thus we must remove those instances from the main datset that had been drawn for making the training set.

In [9]:
#obtaining training set by sampling an equal numer of instances of eitehr class
income_data_negative=income_data[income_data[88]==0].sample(n=3000,replace=False)
income_data_positive=income_data[income_data[88]==1].sample(n=3000,replace=False)
training_data=pd.concat([income_data_negative,income_data_positive])
training_data=training_data.reindex(np.random.permutation(training_data.index))

#dropping rows of the training data from the test data
income_data=income_data.drop(training_data.index)

In [10]:
#training data
X_train=training_data.loc[:,0:88]
Y_train=training_data.loc[:,88]

#test data
X_test=income_data.loc[:,0:88]
Y_test=income_data.loc[:,88]

#### 8) Standardizing the training and testing datasets :

Standardization is defined as the process of transform the dataframe in such a way that, the variance 
of each column is equal to 1 and the mean is 0. A column is standardized by replace each value of the column by its Z-score. The Z-Score of a value is defined as the number of standard deviations away an observation is from the mean value.


In [11]:
from sklearn.preprocessing import StandardScaler
standardizer=StandardScaler()
X_train=standardizer.fit_transform(X_train)
X_test=standardizer.transform(X_test)

#### 9) Reducing the dimensionality of the training dataset and test dataset using Principal Component Analysis (PCA) :

Dimensionality reduction can be understood in the following way. Any dataset containing numerical columns can be thought of as a multi dimensional space having a specific number of dimensions. The dimensionality of the space is equal to the number of columns in the dataset.The rows of the datasets, which are also known as observations, can be thought of as vectors pointing in different directions within that multidimensional space. What PCA does is, it detrmines various unit vectors in that multi-dimensional space such that the 'statistical-variance' of projection of obervations is maximum along those vectors. The important thing to keep in mind is, PCA finds determines such unit vectors in a quantity that is much smaller than the original dimensionality of the dataset. That's why PCA is called a 'dimensionality-reduction' method. 

The unit vectors determined as a consequence of application of PCA are called 'Principal Components'. Each 'Principal Component' captures some proportion of variation of the data in the dataset. Onnce the principal components are determined, the observations are projected on to them.

In [12]:
from sklearn.decomposition import PCA
pca_object=PCA(0.90)
X_train=pca_object.fit_transform(X_train)
X_test=pca_object.transform(X_test)
print('NUMBER OF COMPONENTS :',pca_object.n_components_)

NUMBER OF COMPONENTS : 67


We have thus removed the dimensionality of the training and testing data from 89 to 65 while rertaining 90% of the variance.

#### 10) Applying Voting Classifier in the frame work of Hard Voting Scheme :

In *Hard-Voting* class labels are designated to instaces based on, which among all the class labels has been predicted by the maximum number of classifiers or which class labels has recieved the maximum number of votes. 
We will use, LogisticRegression, RandomForestClassifier,LogisticRegression and Support Vector Classifier in the voting ensemble. We will then evaluate the performance of each of these classifiers and compare their individual performance to the performance of their collective ensemble(Voting Classifier).

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import confusion_matrix,accuracy_score

voting_clf=VotingClassifier(voting='hard',estimators=[('logreg',LogisticRegression()),('tree_clf',DecisionTreeClassifier()),
                                                     ('rfc_clf',RandomForestClassifier()),('svm_clf',SVC(probability=True))])

for clf in [LogisticRegression(),DecisionTreeClassifier(),RandomForestClassifier(),SVC(),voting_clf]:
    clf.fit(X_train,Y_train)
    Y_pred=clf.predict(X_test)
    print('Accuracy on',clf.__class__.__name__,':',100*accuracy_score(Y_test,Y_pred))

Accuracy on LogisticRegression : 93.36086890010708
Accuracy on DecisionTreeClassifier : 92.09627250012747
Accuracy on RandomForestClassifier : 96.02773953393505
Accuracy on SVC : 96.83086023150273
Accuracy on VotingClassifier : 97.46315843149253


  if diff:


#### Conclusion :

From the above results, we can clearly see that the performance of the voting classifier based on an ensemble of different classifiers is better than that of any of the individual classifiers that make the ensemble.