## Dataset
We will be using a banking marketing dataset. 
The dataset is associated with direct marketing campaigns of a banking institution. We want to find out the best strategies to improve the next marketing campaign. How can the bank have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies.

We have to predict whether a customer subscribes for term deposit or not using the following attributes: 
1 - age (numeric)<br>
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')<br>
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)<br>
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')<br>
5 - default: has credit in default? (categorical: 'no','yes','unknown')<br>
6 - balance: balance amount (numeric)<br>
7 - housing: has housing loan? (categorical: 'no','yes','unknown')<br>
8 - loan: has personal loan? (categorical: 'no','yes','unknown')<br>
8 - contact: contact communication type (categorical: 'cellular','telephone')<br>
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<br>
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<br>
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<br>
14 - previous: number of contacts performed before this campaign and for this client (numeric)<br>
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<br>

features_ex2.xlsx contains the features. It has 4521 records. First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset and final 678 observations are used for test dataset.

label_ex2.xlsx contains the label: "yes" or "no". First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
X = pd.read_excel("features_ex2.xlsx")
X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,1,-1,0,unknown
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,1,339,4,failure
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,1,330,1,failure
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,4,-1,0,unknown
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,1,-1,0,unknown


In [3]:
y = pd.read_excel("label_ex2.xlsx")
y.head()

Unnamed: 0,y
0,no
1,no
2,no
3,no
4,no


In [4]:
categories = ['job','marital','education','default','housing','loan','contact','month','poutcome']
categorical = pd.get_dummies(X[categories])
continuous = X.drop(columns=categories)
X = pd.concat([continuous,categorical],axis=1)

In [5]:
#splitting data into train, cv and test set (70:15:15 ratio)
X_train = X.iloc[0:3165,:]
y_train = y.iloc[0:3165,:]
X_cv = X.iloc[3165:3843,:]
y_cv = y.iloc[3165:3843,:]
X_test = X.iloc[3843:4521,:]

In [6]:
#Changing Yes and No to 1 and 0
mapping = {"yes":1, "no":0}
y_cv.replace(mapping, inplace=True)
y_train.replace(mapping, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [7]:
print("X_train "+ str(X_train.shape))
print("y_train "+ str(y_train.shape))
print("X_cv "+ str(X_cv.shape))
print("y_cv "+ str(y_cv.shape))
print("X_test "+ str(X_test.shape))

X_train (3165, 50)
y_train (3165, 1)
X_cv (678, 50)
y_cv (678, 1)
X_test (678, 50)


## Standardization

As discussed in previous exercise, standardization is important when a number of features with different scales are involed. 

Q. Use StandardScaler from sklearn.preprocessing to standardize the continuous features. 


In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

continuous_variables = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']

X_train[continuous_variables] = pd.DataFrame(scaler.fit_transform(X_train[continuous_variables]), columns = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [9]:
X_cv = X_cv.reset_index()
del X_cv['index']

X_test = X_test.reset_index()
del X_test['index']

In [10]:
# Similarily use the above list to replace the continuous columns in X_cv and X_test to scaled columns. Use transform method.
X_cv[continuous_variables] = pd.DataFrame(scaler.transform(X_cv[continuous_variables]))
X_test[continuous_variables] = pd.DataFrame(scaler.transform(X_test[continuous_variables]))

## Classification

As previously mentioned, the scikit-learn classification API makes it easy to train a classifier. 


Q. Use LogisticRegression from sklearn.linear_model to make a logistic regression classifier.

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
# Initializing the classifier with default parameters and fitting the classifier on training data and labels

logreg = LogisticRegression(solver = 'liblinear')
logreg.fit(X_train, np.ravel(y_train))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [13]:
# predict the output for cross validation dataset
y_new = logreg.predict(X_cv)

In [14]:
#Implementation of accuracy, precision, recall
from classification_utils import accuracy, precision, recall

acc = accuracy(y_cv.values, y_new)
print('Accuracy:', acc)

pre = precision(y_cv.values, y_new)
print('Precision: ', pre)

rec = recall(y_cv.values, y_new)
print('Recall: ', rec)


Accuracy: 0.8908554572271387
Precision:  0.45
Recall:  0.125


Accuracy is the measure of how well our model predicts future outcomes. But accuracy alone isn't always sufficient to differentiate one model from another. That is when we use precision and recall. When the costs of False Positive is high, Precision is a good measure whereas when the cost of False Negative is high, Recall is a good measure to select our best model. There are scenarios where precision is better than accuracy. For example, in email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.<br/>

We should use precision in this case. The prediction focuses on whether a customer subscribes for term deposit or not. A false positive in this case means wasted expense if the user doesn't actually subscribe. Therefore, precision which emphasizes true positive is the measurement we should use.

### ROC curve

In [15]:
from sklearn.metrics import roc_curve
# calculate the fpr and tpr for all thresholds of the classification

fpr, tpr, thresholds = roc_curve(y_cv.values, logreg.predict_proba(X_cv)[:,1], pos_label =1)

import matplotlib.pyplot as plt
# Plot the ROC curve
plt.title('ROC Curve')
plt.plot(fpr, tpr)                    
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.xlim([0,1])
plt.ylim([0,1])
plt.plot([0, 1], [0, 1], 'y--')
plt.grid(True)
plt.show()

from sklearn.metrics import roc_auc_score
print('AUC:', roc_auc_score(y_cv.values,logreg.predict_proba(X_cv)[:,1]))

<Figure size 640x480 with 1 Axes>

AUC: 0.7778236156949029


## Hyperparameters

"Model tuning" refers to model adjustments to better fit the data. This is separate from "fitting" or "training" the model. The fitting/training procedure is governed by the amount and quality of your training data, as the fitting algorithm is unique to each classifier (e.g. logistic regression or random forest). 





Case 1: Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [16]:
logreg1 = LogisticRegression(C = 0.1, penalty = 'l1', solver = 'liblinear')
logreg1.fit(X_train, np.ravel(y_train))

y_new1 = logreg1.predict(X_cv)

acc1 = accuracy(y_cv.values, y_new1)
print('Accuracy:', acc1)

pre1 = precision(y_cv.values, y_new1)
print('precision:', pre1)

rec1 = recall(y_cv.values, y_new1)
print('Recall:', rec1)

Accuracy: 0.8982300884955752
precision: 0.6
Recall: 0.125


Case 2: Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [17]:
logreg2 = LogisticRegression(C = 0.5, penalty = 'l1', solver = 'liblinear')
logreg2.fit(X_train, np.ravel(y_train))

y_new2 = logreg2.predict(X_cv)

acc2 = accuracy(y_cv.values, y_new2)
print('Accuracy:', acc2)

pre2 = precision(y_cv.values, y_new2)
print('precision:', pre2)

rec2 = recall(y_cv.values, y_new2)
print('Recall:', rec2)

Accuracy: 0.8938053097345132
precision: 0.5
Recall: 0.1388888888888889


Case 3: Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [18]:
logreg3 = LogisticRegression(C = 0.1, penalty = 'l2', solver = 'liblinear')
logreg3.fit(X_train, np.ravel(y_train))

y_new3 = logreg3.predict(X_cv)

acc3 = accuracy(y_cv.values, y_new3)
print('Accuracy:', acc3)

pre3 = precision(y_cv.values, y_new3)
print('precision:', pre3)

rec3 = recall(y_cv.values, y_new3)
print('Recall:', rec3)

Accuracy: 0.8982300884955752
precision: 0.6
Recall: 0.125


Case 4: Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 

In [19]:
logreg4 = LogisticRegression(C = 0.5, penalty = 'l2', solver = 'liblinear')
logreg4.fit(X_train, np.ravel(y_train))

y_new4 = logreg4.predict(X_cv)

acc4 = accuracy(y_cv.values, y_new4)
print('Accuracy:', acc4)

pre4 = precision(y_cv.values, y_new4)
print('precision:', pre4)

rec4 = recall(y_cv.values, y_new4)
print('Recall:', rec4)

Accuracy: 0.8923303834808259
precision: 0.47368421052631576
Recall: 0.125


Looks like the third one, with hyperparameter 'C' set to 0.1 and penalty set to 'l2', is better since it has higher precision.

# Test set

In [20]:
final_model = LogisticRegression(C = 0.1, penalty = 'l2', solver = 'liblinear')
final_model.fit(X_train, np.ravel(y_train))
predicted = pd.DataFrame(final_model.predict(X_test), columns = ['y'])

back_map = {1:"yes", 0:"no"}
predicted.replace(back_map, inplace=True)

predicted.to_csv('result.csv', index = False)