# 4.3.6 Challenge Make Your Own Network

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Import the model.
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Import Metrics
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import cross_val_score

# Bank Marketing Data Set
__Abstract:__ The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

__Source:__ [UCI Machine Learning Repository Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)

__Data Set Information:__ The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.  

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

__Attribute Information:__
Input variables:
#### bank client data:
- 1 - age (numeric)
- 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- 5 - default: has credit in default? (categorical: 'no','yes','unknown')
- 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
- 7 - loan: has personal loan? (categorical: 'no','yes','unknown')
#### related with the last contact of the current campaign:
- 8 - contact: contact communication type (categorical: 'cellular','telephone') 
- 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
- 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### other attributes:
- 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- 14 - previous: number of contacts performed before this campaign and for this client (numeric)
- 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
#### social and economic context attributes
- 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
- 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
- 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
- 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
- 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
- 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

## Loading, cleaning and creatinf dummy variables in our dataset

Reading the dataset:

In [2]:
bkm = pd.read_excel('/home/swisswaygo/Mache/bkm.xls')
bkm.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [3]:
bkm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

Looks like most of the columns are numerical with the exception of the File information and date.  It looks like there are a couple of rows with missing information. Let's check the tail of the data set to see if they are at the end of the set.

In [4]:
bkm.tail()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41187,74,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,3,999,1,failure,-1.1,94.767,-50.8,1.028,4963.6,no


Yep, there are null rows at the end (and the one at the beginning).  We should drop all of these.

In [5]:
bkm = bkm.dropna()
print(len(bkm))
bkm.head()

41188


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [6]:
# Create a set of dummy variables from the sex variable
bkm_job = pd.get_dummies(bkm['job'])
bkm_marital = pd.get_dummies(bkm['marital'])
bkm_education = pd.get_dummies(bkm['education'])
bkm_housing = pd.get_dummies(bkm['housing'])
bkm_loan = pd.get_dummies(bkm['loan'])

# Join the dummy variables to the main dataframe
bkm = pd.concat([bkm, bkm_job, bkm_marital, bkm_education, bkm_housing, bkm_loan], axis=1)
bkm.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,illiterate,professional.course,university.degree,unknown,no,unknown.1,yes,no.1,unknown.2,yes.1
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,0,0,0,0,1,0,0,1,0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,0,0,0,0,1,0,0,1,0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,0,0,0,0,0,0,1,1,0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,0,0,0,0,1,0,0,1,0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,0,0,0,0,1,0,0,0,0,1


## Building a Model - Default Settings

We will use multi-layer perceptron modeling (MLP) to classify if a client has subscribed to a term deposit.

We will drop non-numerical data:

In [7]:
bkm.drop(['job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'poutcome', ],axis=1,inplace=True) #Removing unnecessary columns

In [8]:
bkm.shape

(41188, 41)

In [9]:
# Identifying variables
X = bkm.drop('y', axis=1)
Y = bkm.y

In [10]:
# Establishing and fitting the model, with a single, 100 perceptron layer.
mlp = MLPClassifier()
mlp.fit(X, Y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

The following are the ground truth percentages for reference:

In [11]:
Y.value_counts()/len(Y)

no     0.887346
yes    0.112654
Name: y, dtype: float64

We will calculate the adjusted rand score.  This score will tell us how the prediction relates to the ground truth of the data.
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

In [12]:
# 10-fold cross validation
ars = cross_val_score(mlp, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Scores: {:.5f}(+/- {:.2f})'.format(ars.mean(), ars.std()*2))

Cross Validation Scores: 0.21621(+/- 0.36)


The adjusted rand score is approximately 0.21, which indicates random labeling, and the large variance indicates that this model is overfitting.

In [13]:
# Get predicted clusters.
full_pred = mlp.predict(X)
pd.crosstab(Y, full_pred) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,36431,117
yes,4299,341


We can see "no deposits" is most common, so this skew is significant in the data.

## Model 2 - Logistic Activation
We will keep the default MLP settings, although we will change the activation to logistic.

In [14]:
# Establish and fit the model, with default settings.
mlp2 = MLPClassifier(activation='logistic')
mlp2.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [15]:
# 10-fold cross validation
ars2 = cross_val_score(mlp2, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars2.mean(), ars2.std()*2))

Cross Validation Adjusted Rand Scores: 0.22758(+/- 0.23)


The adjusted random score is slightly higher, variance decreased, showing there is less overfitting.

In [16]:
# Get predicted clusters.
full_pred2 = mlp2.predict(X)
pd.crosstab(Y, full_pred2) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,36286,262
yes,3823,817


The results didn't change much, so will change further parameters and try to optimize our models.

## Model 3 - Playing with Size of Layers
Let's keep the logistic activation and then increase the size of the layer to our model. 

In [17]:
# Establish and fit the model, with default settings.
mlp3 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000))
mlp3.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=1000, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [18]:
# 10-fold cross validation
ars3 = cross_val_score(mlp3, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars3.mean(), ars3.std()*2))

Cross Validation Adjusted Rand Scores: 0.11631(+/- 0.23)


Adjusted rand score is lower, which is not good, and cross validation variance is remained the sames as the previous model.

In [19]:
# Get predicted clusters.
full_pred3 = mlp3.predict(X)
pd.crosstab(Y, full_pred3) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,33130,3418
yes,1362,3278


## Model 4 - Multiple Large Layers
We will add two layers with a size of 1000 each.

In [20]:
# Establish and fit the model, with default settings.
mlp4 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000, 1000))
mlp4.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 1000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [21]:
# 10-fold cross validation
ars4 = cross_val_score(mlp4, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars4.mean(), ars4.std()*2))

Cross Validation Adjusted Rand Scores: 0.20763(+/- 0.27)


Our adjusted rand scores increased, and the variance increased.  We will try some other adjustments in hyperparameters.

In [22]:
# Get predicted clusters.
full_pred4 = mlp4.predict(X)
pd.crosstab(Y, full_pred4) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,36158,390
yes,3694,946


## Model 5 - Alpha
We will reduce alpha:

In [23]:
# Establish and fit the model, with default settings.
mlp5 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000, 1000), alpha=1e-6)
mlp5.fit(X, Y)

MLPClassifier(activation='logistic', alpha=1e-06, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 1000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [24]:
# 10-fold cross validation
ars5 = cross_val_score(mlp5, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars5.mean(), ars5.std()*2))

Cross Validation Adjusted Rand Scores: 0.19107(+/- 0.30)


The adjusted randomn score (index), slightly decreased in respect to the previous model

In [25]:
# Get predicted clusters.
full_pred5 = mlp5.predict(X)
pd.crosstab(Y, full_pred5) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,36341,207
yes,3928,712


## Model 6 - Smaller Layers, Higher Alpha
We will use now a smaller layer, and a higher alpha:

In [26]:
# Establish and fit the model, with default settings.
mlp6 = MLPClassifier(activation='logistic', alpha=1e-7)
mlp6.fit(X, Y)

MLPClassifier(activation='logistic', alpha=1e-07, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [27]:
# 10-fold cross validation
ars6 = cross_val_score(mlp6, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars6.mean(), ars6.std()*2))

Cross Validation Adjusted Rand Scores: 0.24971(+/- 0.20)


The adjusted rand score is higher, and the variance also decreased.

In [28]:
# Get predicted clusters.
full_pred6 = mlp6.predict(X)
pd.crosstab(Y, full_pred6) 

col_0,no,yes
y,Unnamed: 1_level_1,Unnamed: 2_level_1
no,35901,647
yes,3395,1245


We will now compare these last results using the gradient bossted classifier on our dataset

# Gradient Boosted Classifier Model


In [29]:
#instantiating and fitting the model
gbc = GradientBoostingClassifier()
gbc.fit(X, Y)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [30]:
# 10-fold cross validation
ars7 = cross_val_score(gbc, X, Y, scoring='adjusted_rand_score', cv=5)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars7.mean(), ars7.std()*2))

Cross Validation Adjusted Rand Scores: -0.00611(+/- 0.08)


The adjusted randomn score is not good, however, the variance is much better than the previous models.

Overall, the last MLP model did the best job, though it's not great. However, we can see that it did a much better job than the gradient boosted classifier.

I think these results are partially a consequence of not having enough relevant features. Due to this last fact, I created several dummy variables for multiple categorical features, with the aim to increase the results, which indeed I did.