# AI Apprentice Lab 4 Solution
#### Multi-Layered Perceptron Classification

1. The dataset provided is described below 
2. Use basic Python packages for numeric computing
3. Replace string values with appropriate ones
4. Split features and labels
5. Split data for training and testing
6. Train MLPClassifier from sklearn.neural_networks
7. Evaluate results

In [1]:
#      IMPORT REQUIRED LIBRARIES
import pandas
import numpy as np
import sklearn

In [2]:
#Load data into Python file: bank-full.csv
data = pandas.read_csv("Data/bank-full.csv", delimiter=';' )

## Dataset information


Citation Request:
  This dataset is public available for research. The details are described in [Moro et al., 2011]. 
  Please include this citation if you plan to use this database:

  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

  Available at: [pdf] http://hdl.handle.net/1822/14838
                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt

1. Title: Bank Marketing

2. Sources
   Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012
   
3. Past Usage:

  The full dataset was described and analyzed in:

  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, 
  Portugal, October, 2011. EUROSIS.

4. Relevant Information:
   ### Dataset features
   The data is related with direct marketing campaigns of a Portuguese banking institution. 
   The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, 
   in order to access if the product (bank term deposit) would be (or not) subscribed. 

   There are two datasets: 
      1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
      2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
   The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

   The classification goal is to predict if the client will subscribe a term deposit (variable y).

5. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

6. Number of Attributes: 16 + output attribute.

7. Attribute information:

   For more information, read [Moro et al., 2011].

   Input variables:
    ### bank client data:
   1 - age (numeric);
   
   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services");
                                       
   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed);
   
   4 - education (categorical: "unknown","secondary","primary","tertiary");
   
   5 - default: has credit in default? (binary: "yes","no");
   
   6 - balance: average yearly balance, in euros (numeric);
   
   7 - housing: has housing loan? (binary: "yes","no");
   
   8 - loan: has personal loan? (binary: "yes","no");
   
    ### related with the last contact of the current campaign:
    
   9 - contact: contact communication type (categorical: "unknown","telephone","cellular") ;
   
  10 - day: last contact day of the month (numeric);
  
  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec");
  
  12 - duration: last contact duration, in seconds (numeric);
  
   ### other attributes:
   
  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact);
  
  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted);
  
  15 - previous: number of contacts performed before this campaign and for this client (numeric);
  
  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success");

  Output variable (desired target):
  
  17 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: None

#### The dataset contains quite a few string values, we need to replace the string values with one-hot encoding to train our classifier. We also need to change binary strings to 0 and 1 values

In [3]:
string_cols = ['job','marital','education','contact','month','poutcome']
for item in string_cols:
    df = pandas.get_dummies(data[item], prefix=item)
    data = data.drop(item, axis=1)
    for col_name in df.columns:
        #Set the new column in data to have corresponding df values
        data[col_name] = df[col_name]
binary_cols = ['default','housing', 'loan', 'y']
bin_dict = {'yes':1, 'no':0}
#Replace binary values in data using the provided dictionary
for item in binary_cols:
    data.replace({item:bin_dict},inplace=True)

#### Now we should normalize our numeric dataset where needed

In [4]:
data

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,58,0,2143,1,0,5,261,1,-1,0,...,0,0,1,0,0,0,0,0,0,1
1,44,0,29,1,0,5,151,1,-1,0,...,0,0,1,0,0,0,0,0,0,1
2,33,0,2,1,1,5,76,1,-1,0,...,0,0,1,0,0,0,0,0,0,1
3,47,0,1506,1,0,5,92,1,-1,0,...,0,0,1,0,0,0,0,0,0,1
4,33,0,1,0,0,5,198,1,-1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,0,825,0,0,17,977,3,-1,0,...,0,0,0,1,0,0,0,0,0,1
45207,71,0,1729,0,0,17,456,2,-1,0,...,0,0,0,1,0,0,0,0,0,1
45208,72,0,5715,0,0,17,1127,5,184,3,...,0,0,0,1,0,0,0,0,1,0
45209,57,0,668,0,0,17,508,4,-1,0,...,0,0,0,1,0,0,0,0,0,1


In [5]:
# This function will peerform min-max normalization on the specified column of a DataFrame
def normalize_column(df, col_name):
     df[col_name] = (df[col_name] - df[col_name].min())/(df[col_name].max()-df[col_name].min())
        
normalize_cols = ['age','balance','day','duration','campaign','pdays','previous']  
#Normalize the specified columns using the provided function
for item in normalize_cols:
    normalize_column(data,item)

In [6]:
data

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,0.519481,0,0.092259,1,0,0.133333,0.053070,0.000000,0.000000,0.000000,...,0,0,1,0,0,0,0,0,0,1
1,0.337662,0,0.073067,1,0,0.133333,0.030704,0.000000,0.000000,0.000000,...,0,0,1,0,0,0,0,0,0,1
2,0.194805,0,0.072822,1,1,0.133333,0.015453,0.000000,0.000000,0.000000,...,0,0,1,0,0,0,0,0,0,1
3,0.376623,0,0.086476,1,0,0.133333,0.018707,0.000000,0.000000,0.000000,...,0,0,1,0,0,0,0,0,0,1
4,0.194805,0,0.072812,0,0,0.133333,0.040260,0.000000,0.000000,0.000000,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,0.428571,0,0.080293,0,0,0.533333,0.198658,0.032258,0.000000,0.000000,...,0,0,0,1,0,0,0,0,0,1
45207,0.688312,0,0.088501,0,0,0.533333,0.092721,0.016129,0.000000,0.000000,...,0,0,0,1,0,0,0,0,0,1
45208,0.701299,0,0.124689,0,0,0.533333,0.229158,0.064516,0.212156,0.010909,...,0,0,0,1,0,0,0,0,1,0
45209,0.506494,0,0.078868,0,0,0.533333,0.103294,0.048387,0.000000,0.000000,...,0,0,0,1,0,0,0,0,0,1


#### The data is ready for training, let's separate the features and labels as well as the training and testing datasets

In [7]:
#Separate the labels and features of the dataset. 
labels = data['y']
features = data.drop('y', axis=1)

In [8]:
#       IMPORT REQUIRED LIBRARIES
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

In [9]:
#Separate the testing and training data with test size of 30%
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.3, random_state=25)

#### Mult-layered Perceptron is a small neural network consisting only of simplest neurons. It requires the shape of layers, and the amount of maximum activations, as well as activation and solver functions

We will use simplest forms of activation and optimizer functions in this set of labs

In [10]:
#Define and train an MLPClassifier named MLPclassifier on the given data
MLPclassifier = MLPClassifier(hidden_layer_sizes=(50,200,50), max_iter=300, activation='relu', solver='adam', random_state=1)
MLPclassifier.fit(x_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(50, 200, 50), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=300,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

#### Now we can evaluate the performance of our neural network for bank marketing prediction

In [11]:
#      TEST PREDICTION ACCURACY
test_pred = MLPclassifier.predict(x_test)
train_pred = MLPclassifier.predict(x_train)
print("Train Accuracy:", metrics.accuracy_score(y_train, train_pred))
print("\n\nTest Accuracy:", metrics.accuracy_score(y_test, test_pred))

Train Accuracy: 0.9841059184124877


Test Accuracy: 0.8866116189914479


*Created by Nicholas Stepanov: https://github.com/renowator*