#### Objective of the case study : To demonstrate the effect of sampling data uniformly across the classes.

Classification models perform poorly on datasets with **class imbalance**. Class imbalance refers to a condition in which instances of one class have an overwhelming majority over instances of the other class/classes. As a result of class imbalance, a classification model performs extremely well on instances of the class that are present in abundance, whereas the performance of the model is extremely poor on instances that belong to the scarce class. In this case study I wish to demonstrate that the aforementioned problem, in the context of binary classification, can be addressed by training the classifier on a datset created by extracting equal number of samples of either class. It has been demonstrated in this case study that a classifier trained on such a dataset has an extremely balanced performance i.e it does equally well on classfying instances of either class correctly.
________________________________________________________________________________________________________________________________

#### Approach :

We will train the classifier on a dataset obtained by sampling exactly an equal number of instances of either class. The performance of the classifier will be evaluated on the original dataset, with the sampled instances removed from it.

________________________________________________________________________________________________________________________________

#### Data Set Information :

DATA SOURCE : https://archive.ics.uci.edu/ml/machine-learning-databases/00222/

The dataset that we have has been derived from a marketing campaign run by a Portugese Banking Institution between 2008 and 2013. By training a classifier on the dataset we have, we want to evolve a model that can be used to asses the likelihood of a client subscribing to term deposit when contacted over the telephone. Clients which have a high likelihood of subscribing to the term deposit are accorded 1 and those having low likelihood are accorded 0 by the classifier.
________________________________________________________________________________________________________________________________

#### Input Features :

#### Bank Client data:
1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

#### Related with the last contact of the current campaign :

8 - contact: contact communication type (categorical: 'cellular','telephone') 

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

#### Other attributes/input features :

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

#### Social and Economic Context attributes/input features :

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric) 

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

#### Output Feature / Target Feature :

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')



#### 1) Loading the dataset into the notebook:

In [4]:
import pandas as pd
import numpy as np
bank_data=pd.read_csv('bank-additional-full.csv',sep=';')
bank_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


#### 2) Checking for Class Imbalance :

In [5]:
bank_data['y'].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

There are 36548 instances which belong to the class 'no', whereas there are just 4640 instances that belong to the class yes.

####  3) DEMONSTRATING THE EFFECT OF UNIFORM SAMPLING :

Note: In this section we shall only focus on the effect of sampling uniformly across the class labels without delving into the specifics of the code. The codes will be discussed in the subsequent sections of the case study.

#### 3.1) Creating a balanced dataframe:
In the dataframe we have, the number of instances for which the target label is **no** outnumbers the number of instances for which the target label is **yes** by a crushing margin. A model trained on a dataset bearing such a high degree of **class imbalance** , when fed with an equal number of instances of either class, the proportion of correct predictions is quite likely
to be much higher for instances that  belong to the abundant class. Thus the model, for a given test observation,
will have a 'bias' towards predicting the class label that is present in abundance. This 'bias' problem can either be resolved by training the model on an equal number of samples of either class or by tweaking the prediction threshold of class label. In this case study we shall focus our attention on the former approach. The following piece of code elaborates what has just been stated.

#### 3.2) Coding a classification model that is trained on a dataset infested with high class-imbalance:

In [6]:
categorical_columns=bank_data.select_dtypes(include=object).columns
categorical_indices=[]
for column in categorical_columns:
    categorical_indices.append(bank_data.columns.get_loc(column))
bank_data.iloc[:,categorical_indices].head()

from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
for column in categorical_indices:
    bank_data.iloc[:,column]=encoder.fit_transform(bank_data.iloc[:,column])


X=bank_data.loc[:,'age':'nr.employed']
Y=bank_data.loc[:,'y']

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=7000,random_state=0)

from sklearn.tree import DecisionTreeClassifier as DTC
tree_clf=DTC()
tree_clf.fit(X_train,Y_train)
Y_pred=tree_clf.predict(X_test)

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,Y_pred)
print('CONFUSION_MATRIX\n',cm)

print('PERCETAGE OF CORRECT PREDICTIONS ON THE ABUNDANT CLASS:',100*(cm[0,0]/(cm[0,0]+cm[0,1])))
print('PERCETAGE OF CORRECT PREDICTIONS ON THE SCRACE CLASS:',100*(cm[1,1]/(cm[1,0]+cm[1,1])))

CONFUSION_MATRIX
 [[5831  396]
 [ 345  428]]
PERCETAGE OF CORRECT PREDICTIONS ON THE ABUNDANT CLASS: 93.64059739842621
PERCETAGE OF CORRECT PREDICTIONS ON THE SCRACE CLASS: 55.368693402328596


The outcome of the code in the above cell greatly exemplifies the effect of training a classifier on a 
dataset replete with high class imbalance. The classifier predicts the instances that belong to class 'no' with a staggering accuracy of 93.64% but performs poorly, with a score of just 55.36%, on instances which belong to class 'yes'. In the upcoming piece of code we will examine the effect of training a classifier on a dataset that has been obtained by sampling an equal number of instances of either class.

#### 3.2) Coding a classification model trained on a datset containing equal number of intsances of either class :

In [7]:
import pandas as pd
import numpy as np
bank_data=pd.read_csv('bank-additional-full.csv',sep=';')

categorical_columns=bank_data.select_dtypes(include=object).columns
categorical_indices=[]
for column in categorical_columns:
    categorical_indices.append(bank_data.columns.get_loc(column))
bank_data.iloc[:,categorical_indices].head()

from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
for column in categorical_indices:
    bank_data.iloc[:,column]=encoder.fit_transform(bank_data.iloc[:,column])
    
data_positive=bank_data[bank_data['y']==1].sample(n=3000,replace=False,random_state=0)
data_negative=bank_data[bank_data['y']==0].sample(n=3000,replace=False,random_state=0)

list1=[data_positive,data_negative]
training_data=pd.concat(list1)
training_data=training_data.reindex(np.random.permutation(training_data.index))
bank_data=bank_data.drop(training_data.index)

X_train=training_data.loc[:,'age':'nr.employed'].values
X_test=bank_data.loc[:,'age':'nr.employed'].values
Y_train=training_data.loc[:,'y'].values
Y_test=bank_data.loc[:,'y'].values

from sklearn.tree import DecisionTreeClassifier
tree_clf=DecisionTreeClassifier()
tree_clf.fit(X_train,Y_train)
Y_pred=tree_clf.predict(X_test)

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,Y_pred)
print('CONFUSION_MATRIX\n',cm)

print('PERCETAGE OF CORRECT PREDICTIONS ON THE ABUNDANT CLASS:',100*(cm[0,0]/(cm[0,0]+cm[0,1])))
print('PERCENTAGE OF CORRECT PREDICTIONS ON THE SCARCE CLASS:',100*(cm[1,1]/(cm[1,0]+cm[1,1])))

CONFUSION_MATRIX
 [[28051  5497]
 [  284  1356]]
PERCETAGE OF CORRECT PREDICTIONS ON THE ABUNDANT CLASS: 83.61452247525934
PERCENTAGE OF CORRECT PREDICTIONS ON THE SCARCE CLASS: 82.68292682926828


It is pretty evident from the above piece of code that sampling instances unidormly across the class labels leads to an overall boost in accuracy of predictions.Even though the the accuracy on the instances that belong to the 'abundant' class labels has diminished from 93.54% to 83.61% but the accuracy of predictions on instances that belong to the 'scarce' class label has increased from 55.36% to 82.68%.

#### 4) DELVING INTO THE CODES : 

####  4.1) Importing the 'Bank Marketing Dataset' :

In [8]:
import pandas as pd
import numpy as np
bank_data=pd.read_csv('bank-additional-full.csv',sep=';')

#### 4.2) 'LabelEncoding' the dataset :

Label Encoding is the process of assigning numerical labels to values contained within categorical input features of the
dataframe. Label Encoding is performed inorder to facilitate the application of predictive mathematical models such
as **Logistic Regression**, **Support Vector Machines**, **Naive-Baye's** etc, to those datasets which contain categorical/non-numerical data. Label Encoding is performed in two stages which are as follows:

1) The Categorical attributes have to be fetched from the main dataframe.

2) The values contained within the categorical attributes have been assigned numerical labels.

#### 4.2.1) Fetching the categorical attributes:

In [9]:
categorical_columns=bank_data.select_dtypes(include=object).columns
categorical_indices=[]
for column in categorical_columns:
    categorical_indices.append(bank_data.columns.get_loc(column))
bank_data.iloc[:,categorical_indices].head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,y
0,housemaid,married,basic.4y,no,no,no,telephone,may,mon,nonexistent,no
1,services,married,high.school,unknown,no,no,telephone,may,mon,nonexistent,no
2,services,married,high.school,no,yes,no,telephone,may,mon,nonexistent,no
3,admin.,married,basic.6y,no,no,no,telephone,may,mon,nonexistent,no
4,services,married,high.school,no,no,yes,telephone,may,mon,nonexistent,no


#### 4.2.2) Assigning numerical labels / 'LabelEncoding' the categorical attributes:

In [10]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
for column in categorical_indices:
    bank_data.iloc[:,column]=encoder.fit_transform(bank_data.iloc[:,column])
bank_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,3,1,0,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
1,57,7,1,3,1,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
2,37,7,1,3,0,2,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
3,40,0,1,1,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
4,56,7,1,3,0,0,2,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0


#### 4.3) OneHotEncoding the 'LabelEncoded' attributes:

In order to facilitate the application of mathamatical models to datasets, merely assigning numerical labels to categorical attributes is simply not enough. One must remember that the assigned numerical labels are not related to each other in an ordinal sense, therefore we use a technique called 'OneHotEncoding' which, what basically does is, the following :

A column representing a categorical attribute is split into multiple columns such that we have new columns equal to the number of all the numerical labels used for encoding the values contained within the column under consideration. Inorder to expand upon what has just been stated, consider the following, the column of the dataframe named 'job' contains 41118 values, these 41118 values have been assigned numerical labels using integers from 0 to 12 i.e 13 integers. We will now split the 'job' column into 13 columns and each of the columns will represent an integer from 0 to 12.

For a particular observation (row index) if the job is encoded with a label '3', it will reflect in the newly created columns in the following way, the column that reprsents label '3' will be assigned 1 whereas rest of the columns will be assigned '0' and so on. This holds true for all the encoded categorical columns.

To sum up 'OneHotEncoding' can be described as the process of assigning a binary sequence of a particular 'length' to each value conatined within a 'LabelEncoded' attribute. The 'length'of the binary sequence is equal to the number of numerical labels used to represent the different values contained within a categorical column.

**CAUTION!!! : WE MUST REFRAIN FROM 'OneHotEncoding' THE TARGET ATTRIBUTE**

#### 4.3.1) 'OneHotEncoding' the 'LabelEncoded' features :

**Sequence of steps:**

1) Import **OneHotEncoder** class from preprocessing module.

2) Instantiate an object of the **OneHotEncoder** class and feed it the categorical input indices that are to be hot encoded.

3) Apply **OneHotEncoding** to the input features.

4) Create a table containing the splits rendered to each input categorical feature.


In [12]:
from sklearn.preprocessing import OneHotEncoder
X=bank_data.loc[:,'age':'nr.employed']
hot_encoder=OneHotEncoder(categorical_features=[1,2,3,4,5,6,7,8,9,14])
X=hot_encoder.fit_transform(X).toarray()
Y=bank_data.loc[:,'y'].values
categorical_column_splitting=pd.DataFrame(data={'Categorical_index':[1,2,3,4,5,6,7,8,9,14],
                                               'Number_of_splits':hot_encoder.n_values_})
categorical_column_splitting

Unnamed: 0,Categorical_index,Number_of_splits
0,1,12
1,2,4
2,3,8
3,4,3
4,5,3
5,6,3
6,7,2
7,8,10
8,9,5
9,14,3


#### 4.3.2) Creating a Dataframe of Hot Encoded features :

In [14]:
dataframe=pd.DataFrame(data=X)
dataframe[63]=Y
dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,261.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,149.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,226.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,151.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,307.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0


#### 4.4) Extracting the training data and testing data from the 'OneHotEncoded' Dataframe:

#### 4.4.1) Sampling the training data :

Training data refers to the data that is fed to a predictive model inorder to enable it to make predictions on observations it has never seen before. Since the dataset we're dealing with has high 'class-imbalance' thus we have to sample an equal number of instances of either class. The following is the sequence of steps that has been followed in the code:

1) Choose 3000 samples of class '0' and save those in a dataframe named 'dataframe_negative'.

2) Choose 3000 samples of class '1' and save those in a dataframe names 'dataframe_positive'.

3) Concatenate the dataframes thus created to for a new dataframe and name it 'training data'.

4) Sample the training data along all the rows of all the training features of 'training_data'.

5) Sample the testing_data along all the rows of all the target features of 'training_data'.

In [15]:
dataframe_negative=dataframe[dataframe[63]==0].sample(n=3000,replace=False,random_state=0)
dataframe_positive=dataframe[dataframe[63]==1].sample(n=3000,replace=False,random_state=0)

list1=[dataframe_negative,dataframe_positive]
training_data=pd.concat(list1)

training_data=training_data.reindex(np.random.permutation(training_data.index))

X_train=training_data.iloc[:,0:63].values
Y_train=training_data.iloc[:,63].values

#### 4.4.2) Sampling the testing data :

Testing data refers to the data that is used to evaluate the performance of a predictive model once it has been trained on some data. Following are the steps involved in sampling the testing data.

1) From the main dataframe we remove all those observations that are in the training set.

2) The datarame is sampled along all the rows of all the training features of the source dataframe.

3) The dataframe is sampled along all the rows of the target feature.

In [16]:
dataframe=dataframe.drop(training_data.index)
X_test=dataframe.iloc[:,0:63].values
Y_test=dataframe.iloc[:,63].values

####  4.5) Standardizing the Data :

Standardization is defined as the process of transform the dataframe in such a way that, the variance 
of each column is equal to 1 and the mean is 0. A column is standardized by replace each value of the column by its Z-score. The Z-Score of a value is defined as the number of standard deviations away an observation is from the mean value.

Sequence of steps:

1) Import the 'StandardScaler' class from sklearn's 'preprocessing' module.

2) Instantiate an object of the 'StandardScaler' class.

3) Standardize the training set using the '.fit_transform()' method.

4) Standardize the testing set using the '.transform()' method.

In [17]:
from sklearn.preprocessing import StandardScaler
standardizer=StandardScaler()
X_train=standardizer.fit_transform(X_train)
X_test=standardizer.transform(X_test)

#### 4.6) Dimensionality Reduction using Principal Component Analysis (PCA) :

Dimensionality reduction can be understood in the following way. Any dataset containing numerical columns can be thought of as a multi dimensional space having a specific number of dimensions. The dimensionality of the space is equal to the number of columns in the dataset.The rows of the datasets, which are also known as observations, can be thought of as vectors pointing in different directions within that multidimensional space. What PCA does is, it detrmines various unit vectors in that multi-dimensional space such that the 'statistical-variance' of projection of obervations is maximum along those vectors. The important thing to keep in mind is, PCA finds determines such unit vectors in a quantity that is much smaller than the original dimensionality of the dataset. That's why PCA is called a 'dimensionality-reduction' method. 

The unit vectors determined as a consequence of application of PCA are called 'Principal Components'. Each 'Principal Component' captures some proportion of variation of the data in the dataset. Onnce the principal components are determined, the observations are projected on to them.

Sequence of Steps:

1) From the decomposition module of sklearn, import PCA.

2) Instantiate an object of PCA class and pass the proportion of total variance to be preserved.

3) Fit the object on the training data.

4) Apply the object to the testing data.

In [18]:
from sklearn.decomposition import PCA
pca_object=PCA(0.95)
X_train=pca_object.fit_transform(X_train)
X_test=pca_object.transform(X_test)
print('NUMBER OF PRINCIPAL COMPONENTS:',pca_object.n_components_)

NUMBER OF PRINCIPAL COMPONENTS: 41


#### 4.7) Fitting Models and making Predictions:

Now we are going to fit various models on our training data and observe the performace of each model on the testing data.

In [19]:
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble import VotingClassifier
voting_clf=VotingClassifier(voting='hard',estimators=[('clf1',LogisticRegression()),('clf2',SVC()),('clf3',DTC()),('clf4',KNC())])

for clf in [LogisticRegression(),DTC(),SVC(),KNC(),voting_clf]:
    clf.fit(X_train,Y_train)
    Y_pred=clf.predict(X_test)
    cm=confusion_matrix(Y_test,Y_pred)
    print('PERFORMANCE ON',clf.__class__.__name__,'CLASSIFIER')
    print('ACCURACY_SCORE: ',100*accuracy_score(Y_test,Y_pred))
    print('CONFUSION_MATRIX : \n',confusion_matrix(Y_test,Y_pred))
    print('PERCETAGE OF CORRECT PREDICTIONS OF THE ABUNDANT CLASS :',100*(cm[0,0]/(cm[0,0]+cm[0,1])))
    print('PERCETAGE OF CORRECT PREDICTIONS OF THE SCARCE CLASS:',100*(cm[1,1]/(cm[1,0]+cm[1,1])))
    print('____________________________________________________')

PERFORMANCE ON LogisticRegression CLASSIFIER
ACCURACY_SCORE:  85.95543935432534
CONFUSION_MATRIX : 
 [[28819  4729]
 [  213  1427]]
PERCETAGE OF CORRECT PREDICTIONS OF THE ABUNDANT CLASS : 85.90377965899606
PERCETAGE OF CORRECT PREDICTIONS OF THE SCARCE CLASS: 87.01219512195122
____________________________________________________
PERFORMANCE ON DecisionTreeClassifier CLASSIFIER
ACCURACY_SCORE:  76.4521996135046
CONFUSION_MATRIX : 
 [[25666  7882]
 [  404  1236]]
PERCETAGE OF CORRECT PREDICTIONS OF THE ABUNDANT CLASS : 76.50530583045189
PERCETAGE OF CORRECT PREDICTIONS OF THE SCARCE CLASS: 75.36585365853658
____________________________________________________
PERFORMANCE ON SVC CLASSIFIER
ACCURACY_SCORE:  83.61089007616232
CONFUSION_MATRIX : 
 [[27938  5610]
 [  157  1483]]
PERCETAGE OF CORRECT PREDICTIONS OF THE ABUNDANT CLASS : 83.27769166567307
PERCETAGE OF CORRECT PREDICTIONS OF THE SCARCE CLASS: 90.42682926829269
____________________________________________________
PERFORMANCE ON K

  if diff:


#### 4.8) Outcome :

From the results it is evident that training a classifier on a dataset created by sampling instances unoformly across all the classes performs equally well on instances of either class as opposed to a classifier trained on a dataset bearing high class imbalance, which does extremely one well on instances of the abundant class, but performs poorly on instances that belong to the minority class.