# Challenge: Make Your Network

For this challenge you have two options for how to use neural networks . Choose one of the following:

* Use RBM to perform feature extraction on an image-based dataset that you find or create. If you go this route, present the features you extract and explain why this is a useful feature extraction method in the context you’re operating in. DO NOT USE either the MNIST digit recognition database or the iris data set. They’ve been worked on in very public ways very very many times and the code is easily available. (However, that code could be a useful resource to refer to). _OR_,

* Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

Once you've chosen which option you prefer, get to modeling and submit your work below.

# Introduction
For this exercise we will be using the Bank Marketing dataset supplied by the UCI Machine Learning Repository at the link below.
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

This dataset is derived from a Portuguese banking institution and is related to its direct marketing campaigns, which were based on phone calls. The classification goal is to predict whether or not a customer will subscribe to a bank term deposit ('yes' or 'no'). This is our target variable y.

The dataset consists of 20 features and a total of 41188 entries.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
%matplotlib inline

In [6]:
df = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [7]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [8]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [9]:
df.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [10]:
X = df.drop(['y'], axis=1)
y = df['y']

In [11]:
X = pd.get_dummies(X, sparse=True)
X.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0


In [12]:
X.shape

(41188, 63)

## Modeling

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
# Store layer size tuples and alphas in a list
hidden_layers = [(100,), (100,5), (1000,5), (1000,50)]
alphas = [0, 1e-6, 1e-3]

# Iterate through each layer size and alpha
for layer in hidden_layers:
    for alpha in alphas:
        # Establish and fit the model
        mlp = MLPClassifier(hidden_layer_sizes=layer, alpha=alpha)
        mlp.fit(X_train, y_train)
        score = mlp.score(X_train, y_train)
        print('Hidden layer size: {}'.format(layer))
        print('Alpha: {}'.format(alpha))
        print('Accuracy: {:.2%}\n'.format(score))

Hidden layer size: (100,)
Alpha: 0
Accuracy: 90.96%

Hidden layer size: (100,)
Alpha: 1e-06
Accuracy: 90.47%

Hidden layer size: (100,)
Alpha: 0.001
Accuracy: 90.80%

Hidden layer size: (100, 5)
Alpha: 0
Accuracy: 90.92%

Hidden layer size: (100, 5)
Alpha: 1e-06
Accuracy: 90.22%

Hidden layer size: (100, 5)
Alpha: 0.001
Accuracy: 88.79%

Hidden layer size: (1000, 5)
Alpha: 0
Accuracy: 88.79%

Hidden layer size: (1000, 5)
Alpha: 1e-06
Accuracy: 90.84%

Hidden layer size: (1000, 5)
Alpha: 0.001
Accuracy: 88.79%

Hidden layer size: (1000, 50)
Alpha: 0
Accuracy: 90.77%

Hidden layer size: (1000, 50)
Alpha: 1e-06
Accuracy: 90.85%

Hidden layer size: (1000, 50)
Alpha: 0.001
Accuracy: 90.79%



In [14]:
# Instantiate and fit model with best parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Calculate accuracy scores
score_train = rfc.score(X_train, y_train)
score_test = rfc.score(X_test, y_test)

# Print results
print('Training Set Accuracy: {:.2%}\n'.format(score_train))
print('Test Set Accuracy: {:.2%}\n'.format(score_test))



Training Set Accuracy: 99.28%

Test Set Accuracy: 90.52%



In [16]:
y_pred = rfc.predict(X_test)

# Run cross validation
scores = cross_val_score(rfc, X_train, y_train, cv=5)

# Print results
print('Cross Validation Scores:\n{}'.format(scores))
print('Average Cross Validation Score:\n{0:.2%}'.format(scores.mean()))

Cross Validation Scores:
[0.90636379 0.9084446  0.90877558 0.90686785 0.91049436]
Average Cross Validation Score:
90.82%


We can see that Random Forest provides us with a similar accuracy score as the MLP classifier (~90%) in a fraction of the time. Several iterations of the MLP classifier took extremely long to run, after I increased the hidden layer amount and size. Even then, I was barely able to see any improvement in accuracy. Thus, Random Forest is a more suitable option for modeling this particular dataset.