<b>Building basic Neural Network</b>

In this notebook, we build a basic ANN and analyze the Churn_Modeling data to demonstrate a classification task.

We will use Keras library to build the NN, so make sure that this is installed. As the Keras is like an abstract wrapper for TensorFlow or Theano, the Keras requires these based on which Framework - TensorFlow or Theano you will be asking Keras to use for.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Lets load the data using pandas.

In [2]:
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
#lets see a chunk of the data
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Among the available fields, the field like customerid, surname has no role on the target variable. So, we ignore them in our processing. We consider all the rows but the fields are selected from the field 3 (credit score) to field 12 (Estimated salary).

Our target variable is the field 13  (Exited).
We initialie the X and y variables to hold the independent and dependent/target variable.

In [3]:
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

We can see that we have the gender and country as categorical fields. As computing is easier in numerical form, we encode them to numerical form using the sklearn package.

In [4]:
#lets see the number of independent variables
print(len(X[0]))
#lets see some of the entries
for i in range(10):
    print(X[i])

10
[619 'France' 'Female' 42 2 0.0 1 1 1 101348.88]
[608 'Spain' 'Female' 41 1 83807.86 1 0 1 112542.58]
[502 'France' 'Female' 42 8 159660.8 3 1 0 113931.57]
[699 'France' 'Female' 39 1 0.0 2 0 0 93826.63]
[850 'Spain' 'Female' 43 2 125510.82 1 1 1 79084.1]
[645 'Spain' 'Male' 44 8 113755.78 2 1 0 149756.71]
[822 'France' 'Male' 50 7 0.0 2 1 1 10062.8]
[376 'Germany' 'Female' 29 4 115046.74 4 1 0 119346.88]
[501 'France' 'Male' 44 4 142051.07 2 0 1 74940.5]
[684 'France' 'Male' 27 2 134603.88 1 1 1 71725.73]


In [5]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#create the encoder for the first column to be encoded
labelencoder_X_1 = LabelEncoder()
#encode the first column using this encoder
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
#create an encoder for second column and encode it
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
#lets see the number of categories in the field1 and field2
print(set(X[:,1]),set(X[:,2]))

{0, 1, 2} {0, 1}


So, we have three categories for the field 1 (country) and two categories for the field 2 (gender).

As the country column has more than two categories, it should be noted that they are not comparable, i.e. 0 is not less than 1 and 1 is not greater than 0, and so on. In order to avoid the confusion, we use a different technique called One-Hot Encoding, which encodes the country field (field 1).

In [6]:
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
#lets see the number of independent variables
print(len(X[0]))
#lets see some of the entries
for i in range(10):
    print(X[i])

12
[  1.00000000e+00   0.00000000e+00   0.00000000e+00   6.19000000e+02
   0.00000000e+00   4.20000000e+01   2.00000000e+00   0.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.01348880e+05]
[  0.00000000e+00   0.00000000e+00   1.00000000e+00   6.08000000e+02
   0.00000000e+00   4.10000000e+01   1.00000000e+00   8.38078600e+04
   1.00000000e+00   0.00000000e+00   1.00000000e+00   1.12542580e+05]
[  1.00000000e+00   0.00000000e+00   0.00000000e+00   5.02000000e+02
   0.00000000e+00   4.20000000e+01   8.00000000e+00   1.59660800e+05
   3.00000000e+00   1.00000000e+00   0.00000000e+00   1.13931570e+05]
[  1.00000000e+00   0.00000000e+00   0.00000000e+00   6.99000000e+02
   0.00000000e+00   3.90000000e+01   1.00000000e+00   0.00000000e+00
   2.00000000e+00   0.00000000e+00   0.00000000e+00   9.38266300e+04]
[  0.00000000e+00   0.00000000e+00   1.00000000e+00   8.50000000e+02
   0.00000000e+00   4.30000000e+01   2.00000000e+00   1.25510820e+05
   1.00000000e+00   1.00000

<b> Dummy variable Trap </b>


It looks like we got 2 more fields after doing the One-Hot encoding of the countryh field. This is because the one field for country is now represented by one-hot encoding vector (we need three element vector to represent three different values).

After LabelEncoding and Hot-encoding, we got many dummy variables in our data. By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

So we remove one of the dummy variable to avoid the situation of falling into the dummy variable trap.

In [7]:
X = X[:, 1:]

In [8]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

<b> Feature Scaling </b>

In order to prevent one variable dominating the other variable, we need to perform the feature scaling.

In [9]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

<b> Building Neural Network</b>

Now we are going to create our first basic NN model.

In [10]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

Using TensorFlow backend.


In [11]:
# Initialising the ANN
classifier = Sequential()

<b> Adding a Layer to the Network</b>

Now we are adding a layer in our network. The number of nodes in the input layer is denoted as input_dim (which is 11 in our case because we have 11 input variables). The output_dim is the number of output nodes of this layer. As this is the first layer we are adding, the output_dim will be the input for the next layer. This means in our hidden layer there will be 6 input nodes.
The kernel_initializer is set to uniform to make sure that the weights for this layer are uniformly initialized. The activation function for this layer is set to Rectified Linear Units (RELU).

In [12]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(input_dim = 11, output_dim = 6, kernel_initializer = 'uniform', activation = 'relu'))

  


In [13]:
#adding a dropout layer
# p is the fraction of neurons to be dropped out, here we start with 10%, we can incerase it by 10% if the overfitting is not
# resolved. We need to ensure that p is not too high (>0.5) else it will introduce the situation of underfitting.
classifier.add(Dropout(p= 0.1)) 

  after removing the cwd from sys.path.


<b> Adding another hidden layer</b>

Now lets add another layer to our network. The parameter "units" resembles the number of nodes in this layer. As we already added a layer before this layer, the inputs of this layer is automatically taken by the Keras library.

In [14]:
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
#add a dropout layer
classifier.add(Dropout(p= 0.1)) 

  after removing the cwd from sys.path.


<b> Adding an output layer</b>

Now we are going to add an output layer to our network. Our output layer just outputs a single value (either 1 or 0) so we have units=1 that resembles the output of this layer. The weight of this layer is also set to uniform. Inorder to convert the probability of the binary classifier, we use the sigmoid function as the activation function of this layer. 

As a note, if our classifier needs to classify the data into three classes, then we need to set units=2 and set the activation function to softmax.

In [15]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

<b> Compiling the network</b>

We need to compile the network before it can be executed. We can provide additional parameters during we compile the network.

The optimizer is to select the algorithm that defines the optimal set of weights in the network. The "adam" is one of the very popular stochastic algorithm for the weight initialization.


The another parameter is the loss function. We select the binary cross entropy as our loss function because we have two possible outputs. This loss function is needed because we need to select a logarithmic loss function for the sigmoid function which uses the stochastic gradient descent. Using the sigmoid function with stochastic gradient descent is just like a logistic regression model whose loss is not the sum of square of the error but is determined by the logarithmic loss.

As a note, if we had more than two outputs then we could have selected categorical_crossentropy as our loss function.


The metrics to measure the performance of the model is specified by the "metrics" parameter.

In [16]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

<b> Fit the network to the data</b>

We define the batch size of 10 and epochs of 100 to this network.

The batch size indiciates how many observations is to be used before we update the weigth parameter.

The parameter epoch indicates how many round of iterations we need to run with the whole dataset.

In [17]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0xa190828>

<b> Predictions </b>

Now we are going to predict if a customer is going to leave a bank or not. For this, we use a very simple method predict() that is available in the classifier object.

In [18]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
#we can  convert the predictions to True or False values, by comparing the predicted value
# to a standard value of 0.5. If higher then predict is True else it is False.
y_pred = (y_pred > 0.5)

In [19]:
#lets see the vlaue of predictions
print(y_pred)

[[False]
 [False]
 [False]
 ..., 
 [False]
 [False]
 [False]]


<b> Making the confusion matrix</b>

We can use the sklearn package to create the confusion matrix.

In [20]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

The confusion matrix is often useful to measure the peformance of the model. We can use the confusion matrix to find the True-positives, False-positives, True-Negatives and False-Negatives. These values can be used to compute other metrics like accuracy, precision, recall, F-Score, etc.

In [21]:
print(cm)

[[1556   39]
 [ 277  128]]


In [22]:
accuracy = (cm[0][0] + cm[1][1])/(np.sum(cm[0]) + np.sum(cm[1]))
print("accuracy is:",accuracy)

accuracy is: 0.842


Lets compute the other metrics:

precision = tp/(tp+fp),
recall = tp/(tp+fn), 
and 
f-score = 2*precision*recall/(precision + recall)

In [23]:
prec = cm[0][0]/(cm[0][0] + cm[1][0])
rec = cm[0][0]/(cm[0][0] + cm[0][1])
fscore = 2*prec*rec/(prec + rec)

print("precision is:",prec)
print("recall is:",rec)
print("fscore is:",fscore)

precision is: 0.848881614839
recall is: 0.975548589342
fscore is: 0.907817969662


<b> Homework </b>

Use our ANN model to predict if the customer with the following informations will leave the bank: 

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: $60000

Number of Products: 2

Does this customer have a credit card ? Yes

Is this customer an Active Member: Yes

Estimated Salary: $50000

So should we say goodbye to that customer ?

<b> Transforming the data to the required format </b>

Before the given data can be used in our network, we need to transform this data in the format that is understood by the network. To recap, we did label encoding for categorical variables,so we need to transform our data in the similar format.
Lets see how we transform this data and feed it to our network to make the prediction:

In [24]:
#from the label encoding, geography France is taken as 0, 0. The gender Male is taken as 1. 
#Having a credit card is represented by 1
# being an active member is denoted by 1. The rest of the values are fed without the units.
# The standard scaling and transformation is done on the array which includes all these values in the correct order. We can see
# one of the earlier entries to ensure we maintain the proper order.

#we use the row vector by having the values in double braces [[]]
new_prediction = classifier.predict(sc.transform(np.array([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
new_prediction = (new_prediction > 0.5)
print(new_prediction)

[[False]]




So, it looks like the given client is more likely stay.

<b> K-fold cross validation</b>

Lets use cross validation to make sure that we are getting consistent and reliable result. We use KerasClassifier which is a wrapper for the scikit-learn and is useful in measuring the performance using the cross validation.

In [None]:
#lets import the required packages
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

#lets create a method that creates the basic classifier as we had before
def build_classifier():
    classifier = Sequential()

    # Adding the input layer and the first hidden layer
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

    # Adding the second hidden layer
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

    # Adding the output layer
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

    # Compiling the ANN
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier

#the keras classifier takes a method that provides the basic classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
#cv denots the number of folds to be used in cross validation, n_jobs= -1 means use all the available cpus for parallel exeuction
accuracies = cross_val_score(estimator = classifier, X =X_train, y= y_train, cv =10, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()
print(mean)
print(variance)

<b> Dropout regularization</b>

This is a useful technique that helps to improve the performance by minimizing the overfitting. Overfitting occurs when we get high variance in the accuracies in the cross validation. It also occurs when the model performs good in the training data but not in the test data.

The term dropout means at every iteration some nodes are randomly disabled to prevent them from being too much dependent and having dependent correlation. Using this technique, the network learns several independent correlation because every time we have differnt configurations of the neurons.

Keras provides an easier method to add the dropout regularization. After adding a layer, we can just add a line classifier.add(Dropout(p= 0.1)) to add the dropout of fraction of p neurons in the layer.


<b> Parameter tuning </b>

We can tune different hyperparameters (e.g., batch size, epoch, optimizer, and so on). The hyperparameters are the one that were kept fixed in our model (e.g. batch size, epoch, etc.). The other parameters like the weight of neurons in each layer were dynamically selected so they are just called parameters.

Here we are going to play around with a set of hyperparameters and see which one of them give the best solution.

Scikit-learn provides a GridSearchCV class which has the required methods for the parameter tuning process.

In [None]:
# we build a method that prepares the classifier for us.
#this method is just a copy of build_classifier from above but it takes
# parameter to make the method more suitable for parameter tuning
from sklearn.model_selection import GridSearchCV
def build_classifier_GridSearch(optimizer):
    classifier = Sequential()

    # Adding the input layer and the first hidden layer
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

    # Adding the second hidden layer
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

    # Adding the output layer
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

    # Compiling the ANN
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier

#the keras classifier takes a method that provides the basic classifier
classifier = KerasClassifier(build_fn = build_classifier_GridSearch)
parameters = {'batch_size':[25, 30, 35, 40, 45],
             'nb_epoch':[100, 200, 300],
             'optimizer':['adam','rmsprop']}
grid_search = GridSearchCV(estimator = classifier, 
                           param_grid = parameters,
                          scoring = 'accuracy',
                          cv = 10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_

print("best params are:",best_parameters)
print("best accuracy is:",best_accuracy)
