# Homework 2 - Classification

In this exercise you will use scikit-learn, a popular machine learning package in python to train and tune a classifier. A particularly useful feature is that all classifiers (and linear models) are called using the same API, so it is easy to test between different models (see the sklearn-intro notebook for examples). So in this exercise we will a classification technique (logistic regression) that is representative of methods and challenges you will encounter when using any classification method.


## Dataset
We will be using a banking marketing dataset. 
The dataset is associated with direct marketing campaigns of a banking institution. Your job is to find out the best strategies to improve for the next marketing campaign. How can the bank have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies.

You have to predict whether a customer subscribes for term deposit or not using the following attributes: 

1 - age (numeric)<br>
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')<br>
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)<br>
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')<br>
5 - default: has credit in default? (categorical: 'no','yes','unknown')<br>
6 - balance: balance amount (numeric)<br>
7 - housing: has housing loan? (categorical: 'no','yes','unknown')<br>
8 - loan: has personal loan? (categorical: 'no','yes','unknown')<br>
8 - contact: contact communication type (categorical: 'cellular','telephone')<br>
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<br>
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<br>
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<br>
14 - previous: number of contacts performed before this campaign and for this client (numeric)<br>
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<br>

features_ex2.xlsx contains the features. It has 4521 records. First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset and final 678 observations are used for test dataset.

label_ex2.xlsx contains the label: "yes" or "no". First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset. Labels for test dataset are not provided to you because in a real world scenario you will not know the true values for your test set. 

In [1]:
import numpy as np
import pandas as pd
import warnings

In [2]:
warnings.filterwarnings("ignore")

In [3]:
X = pd.read_excel("features_ex2.xlsx")
#X.head(20)

In [4]:
#X.count()

In [5]:
y = pd.read_excel("label_ex2.xlsx")
#y.head(20)
#y.groupby('y').count()

In [6]:
y_num = y.replace('no',0)
y_num = y_num.replace('yes',1)
y_num['y'].value_counts()

0    3405
1     438
Name: y, dtype: int64

In [7]:
categories = ['job','marital','education','default','housing','loan','contact','month','poutcome']
categorical = pd.get_dummies(X[categories])
continuous = X.drop(columns=categories)
X = pd.concat([continuous,categorical],axis=1)

In [8]:
X.head()

Unnamed: 0,age,balance,day,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,1,-1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1,33,4789,11,1,339,4,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,35,1350,16,1,330,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,30,1476,3,4,-1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,5,1,-1,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1


In [9]:
#splitting data into train, cv and test set (70:15:15 ratio)
X_train = X.iloc[0:3165,:]
y_train = y_num.iloc[0:3165,:]
X_cv = X.iloc[3165:3843,:]
y_cv = y_num.iloc[3165:3843,:]
X_test = X.iloc[3843:4521,:]

In [10]:
len(X_test)

678

In [11]:
y_train['y'].value_counts()

0    2799
1     366
Name: y, dtype: int64

In [12]:
print("X_train "+ str(X_train.shape))
print("y_train "+ str(y_train.shape))
print("X_cv "+ str(X_cv.shape))
print("y_cv "+ str(y_cv.shape))
print("X_test "+ str(X_test.shape))

X_train (3165, 50)
y_train (3165, 1)
X_cv (678, 50)
y_cv (678, 1)
X_test (678, 50)


## Standardization

As discussed in previous exercise, standardization is important when a number of features with different scales are involed. 

Q. Use StandardScaler from sklearn.preprocessing to standardize the continuous features. 


In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

continuous_variables = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']

# Use the above list to replace the continuous columns in X_train to scaled columns. Use fit_transform method.
X_train[continuous_variables] = scaler.fit_transform(X_train[continuous_variables])

In [14]:
X_train[continuous_variables]
# Similarily use the above list to replace the continuous columns in X_cv and X_test to scaled columns. 
# Use transform method.
### WRITE CODE HERE
X_cv[continuous_variables] = scaler.transform(X_cv[continuous_variables])
X_test[continuous_variables] = scaler.transform(X_test[continuous_variables])

## Classification

As previously mentioned, the scikit-learn classification API makes it easy to train a classifier. 


Q. Use LogisticRegression from sklearn.linear_model to make a logistic regression classifier.

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
# First, initialize the classifier with default parameters

# then fit the classifier on training data and labels

### WRITE CODE HERE
#logR = LogisticRegression()
clf =  LogisticRegression().fit(X_train,y_train)

In [17]:
# predict the output for cross validation dataset

### WRITE CODE HERE
y_cv_hat = clf.predict(X_cv)
prob = clf.predict_proba(X_cv)


Implement precision(), recall(), accuracy() in exercise_2.py, and use them below.

In [18]:
from classification_utils import accuracy, precision, recall
from sklearn.metrics import confusion_matrix

# Using the predictions to calculate accuracy, precision, recall

### WRITE CODE HERE

acc = accuracy(y_cv,y_cv_hat)
print(acc)

#tn, fp, fn, tp = confusion_matrix(y_cv, y_cv_hat).ravel()
#print(tn, fp, fn, tp)
#(tp+tn)/(tp+tn+fp+fn)

prec = precision(y_cv,y_cv_hat)
print(prec)

recall=recall(y_cv,y_cv_hat)
print(recall)

0.8908554572271387
0.45
0.125


Q. Accuracy<br>
Ans - 0.8908554572271387

Q. Precision<br>
Ans - 0.45

Q. Recall<br>
Ans - 0.125

Q. Which metric (accuracy, precision, recall) is more appropriate and in what cases? Will there be scenarios where it is better to use precision than accuracy? Explain. <br>
Ans -  Accuracy simply measures the number of correct predicted samples over the total number of samples. For instance, if the classifier is 89% correct, it means that out of 100 instances it correctly predicts the class for 89 of them. Accuracy, by itself, however, is not a good measure of evaluating <br>
Precision is a good measure of exactness and to determine situations when the costs of False Positive is high, eg cancer detection, or national security alarms.  <br> 
Recall or Sensitivity is about completeness: it actually calculates how many of the Actual Positives our model captures through labeling it as Positive. When there is a high cost associated with a false negative, it is better to use recall. 


Q. Which metric is suitable in this case? <br>
Ans - Accuracy should be used in this case for banking system

### ROC curve

Q. Use roc_Curve from sklearn.metrics and use matplotlib.pyplot to plot the ROC curve. USe cv set to make predictions.

In [19]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# calculate the fpr and tpr for all thresholds of the classification

### WRITE CODE HERE
#y_num = y.replace('no',0)
#y_num = y_num.replace('yes',1)

fpr, tpr, thresholds = roc_curve(y_cv, prob[:,1])
#print(thresholds.shape)
#print(thresholds,fpr,tpr)

auc = roc_auc_score(y_cv, prob[:,1])
print('AUC: %.3f' % auc)


import matplotlib.pyplot as plt
# Plot the ROC curve by giving appropriate names for title and axes. 

### WRITE CODE HERE
plt.plot(fpr, tpr, 'b',label='ROC curve (auc area = %0.2f)' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('Receiver operating characteristic')
plt.show()

AUC: 0.778


<Figure size 640x480 with 1 Axes>

Q. What is the AUC obtained?<br>
Ans: 0.779

In [20]:
#calculation of AUC

## Hyperparameters

"Model tuning" refers to model adjustments to better fit the data. This is separate from "fitting" or "training" the model. The fitting/training procedure is governed by the amount and quality of your training data, as the fitting algorithm is unique to each classifier (e.g. logistic regression or random forest). 





Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [21]:
### WRITE CODE HERE
clf1 =  LogisticRegression(C=0.1,penalty='l1').fit(X_train,y_train)
y_cv_hat1 = clf1.predict(X_cv)

from classification_utils import accuracy, precision, recall

# Using the predictions to calculate accuracy, precision, recall

accuracy_1 = accuracy(y_cv,y_cv_hat1)
print(accuracy_1)

precision_1 = precision(y_cv,y_cv_hat1)
print(precision_1)

recall_1=recall(y_cv,y_cv_hat1)
print(recall_1)

f1 = (2*precision_1*recall_1)/(precision_1+recall_1)
print(f1)
#tn, fp, fn, tp = confusion_matrix(y_cv, y_cv_hat1).ravel()
#print(tn, fp, fn, tp)

0.8982300884955752
0.6
0.125
0.20689655172413793


Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [22]:
### WRITE CODE HERE
clf2 =  LogisticRegression(C=0.5,penalty='l1').fit(X_train,y_train)
y_cv_hat2 = clf2.predict(X_cv)

from classification_utils import accuracy, precision, recall

# Using the predictions to calculate accuracy, precision, recall

accuracy_2 = accuracy(y_cv,y_cv_hat2)
print(accuracy_2)

precision_2 = precision(y_cv,y_cv_hat2)
print(precision_2)

recall_2=recall(y_cv,y_cv_hat2)
print(recall_2)
f2 = (2*precision_2*recall_2)/(precision_2+recall_2)
print(f2)
#tn, fp, fn, tp = confusion_matrix(y_cv, y_cv_hat2).ravel()


0.8938053097345132
0.5
0.1388888888888889
0.2173913043478261


Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [23]:
### WRITE CODE HERE
clf3 =  LogisticRegression(C=0.1,penalty='l2').fit(X_train,y_train)
y_cv_hat3 = clf3.predict(X_cv)

from classification_utils import accuracy, precision, recall

# Using the predictions to calculate accuracy, precision, recall

accuracy_3 = accuracy(y_cv,y_cv_hat3)
print(accuracy_3)

precision_3 = precision(y_cv,y_cv_hat3)
print(precision_3)

recall_3=recall(y_cv,y_cv_hat3)
print(recall_3)
f3 = (2*precision_3*recall_3)/(precision_3+recall_3)
print(f3)
#tn, fp, fn, tp = confusion_matrix(y_cv, y_cv_hat3).ravel()


0.8982300884955752
0.6
0.125
0.20689655172413793


Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 

In [24]:
### WRITE CODE HERE
clf4 =  LogisticRegression(C=0.5,penalty='l2').fit(X_train,y_train)
y_cv_hat4 = clf4.predict(X_cv)

from classification_utils import accuracy, precision, recall

# Using the predictions to calculate accuracy, precision, recall

accuracy_4 = accuracy(y_cv,y_cv_hat4)
print(accuracy_4)

precision_4 = precision(y_cv,y_cv_hat4)
print(precision_4)

recall_4=recall(y_cv,y_cv_hat4)
print(recall_4)
f4 = (2*precision_4*recall_4)/(precision_4+recall_4)
print(f4)
#tn, fp, fn, tp = confusion_matrix(y_cv, y_cv_hat4).ravel()
#print(tn, fp, fn, tp)

0.8923303834808259
0.47368421052631576
0.125
0.19780219780219782


Q. Which of the above models is better? <br>
Ans- 

# Test set

You have worked on training and cv dataset so far, but testing data does not include the labels. Choose the best hyperparameter values as seen in previous section and build a model. Use this model to make predictions on test set. You will submit a csv file containing your predictions names as predictions.csv.


In [25]:
##########################################
### Construct your final logistic regression using the best hyperparameters obtained above(C and penalty) ###
final_model = LogisticRegression(C=0.1,penalty='l2')
final_model.fit(X_train, y_train)
predicted = final_model.predict(X_test)
predicted_df = pd.DataFrame({'y':predicted})

predicted_df = predicted_df.replace(0,'no')
predicted_df = predicted_df.replace(1,'yes')
print(predicted_df)
predicted_df.to_csv('predicted_y.csv',index = None)

### save into csv with column heading as "y"

      y
0    no
1    no
2    no
3    no
4    no
..   ..
673  no
674  no
675  no
676  no
677  no

[678 rows x 1 columns]


In [26]:
#end 