I will be using ANN, Multi Layer Perceptron to be precise on the Churn Modelling dataset to predict the outcome. We tried to come to a conclution by tuning all possible hyperparameters

**Objective:**

Rick works as Head of Southern TD Canada Bank. Huge number of customers had to leave the bank due to no proper services from Bank. Rick had a hard time solving internal conflicts, streamlining the process and bringing back customers to the bank. He now wants to make sure everything runs smoothly and work on how to retain customers. For this, he wants us to build an application that would predict which of the customers are more likely to leave the bank soon, so that he can work on how to retain the customer. We will be using machine learning algorithms and help Rick in predicting which of the customers are more likely to leave the bank soon.

**Importing Libraries and Dataset**

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../input/deep-learning-az-ann/Churn_Modelling.csv")

**Data Visualization:**

Let us visualize the dataset and its datatypes

In [None]:
# Visualizing the Dataset
df.head()

In [None]:
# Data columns type
df.info()

We drop “RowNumber”,“SurName” and “CustomerID” columns as “RowNumber” is just a series identifier and “SurName” logically doesn’t have any impact on leaving the bank.



In [None]:
df=df.drop(['RowNumber','CustomerId','Surname'],axis='columns')
df.describe()

**Data Pre-Processing**

Data pre-processing has a sequential flow and it starts as follows

**1. Check out the missing values**

In [None]:
# 1. Check out the missing values 
df.isnull().sum()

# As you see below, we dont have any missing values, so we are moving forward and least bothered 
#about correcting the column values

**2. Feature Dropping**

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features. I tried finding correlation between all features and found they don’t have any high correlation.

In [None]:
# Correlation heatmap
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), cmap='BuGn',annot=True)

Before we proceed further with building our model, it is recommended to split data into train , DEV and test first and then apply further pre-processing steps on each dataset separately. 

The reason being, lets say, we might have to normalize or standardize our data, if we standardize the complete dataset and then split it, the test dataset might have the **mean** and **standard deviation** of training dataset as well. This might not give us accurate results and our test data already has a sense of our training data and our model starts to overfit.

So here, we are dividing the dataset into 3 categories. The purpose of dividing is, we validate the accuracy against “DEV” set after each epoch to understand whether the model overfits or underfits. Based on the results,we further tune it and run against test data. This will help us in generating a good model.

In [None]:
#splitting data into Train, DEV, test
from sklearn.model_selection import train_test_split
y=df.Exited # pulling values into another array so that we can drop
X=df.drop(['Exited'],axis='columns')
X_train, X_Dev, y_train, y_Dev = train_test_split(X,y,test_size=0.3,random_state=0,shuffle=False)
X_train, X_test, y_train, y_test = train_test_split(X_train,y_train,test_size=0.2,random_state=0,shuffle=False)


Let us find out Binary , Numerical and Categorial columns in TRAIN, DEV and TEST datasets and divide each datasets into further small datasets each holding these values. This is just done to perform pre-processing in a better way.

Let us start by diving TRAIN dataset first

In [None]:
#[Train] divide train data into categories , numerical and binary

binary_columns=["HasCrCard","IsActiveMember"]
binary_df=pd.DataFrame(X_train[binary_columns])

numerical_columns =["CreditScore","Age","Tenure","Balance","NumOfProducts","EstimatedSalary"]
numerical_df=pd.DataFrame(X_train[numerical_columns])

category_columns=['Geography','Gender']
category_df=pd.DataFrame(X_train[category_columns])


**3. Look for categorial values**

Here we have two categorial columns, “Geography” and “Gender”. Machine Learning models deal only with numbers, so let’s convert this string into integer values. We can use any of the below techniques here
1. Label Encoding
2. One-Hot Encoding

The limitation on label encoding is, after encoding, the values in the dataset might confuse the model as if they are somewhat sequential. In our case, both the columns are of some category type, so we would go for “One-Hot Encoding”.

In [None]:
#[TRAIN] Encode Categorical Data

category_df['Geography'] = category_df['Geography'].astype('category')
category_df['Gender'] = category_df['Gender'].astype('category')
category_df_Final = pd.get_dummies(category_df)

**4. Feature Scaling**

We have few columns in our dataset that are at a different range when compared to others. Below are few among them.

Since they are in different scales, we need to make every column under a common unit. We have two techniques that can help in scaling
1. Normalization: Data normalization is the process of rescaling one or more attributes to the range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.
2. Standardization: Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1

Generally, standardization is preferred, and we are trying to standardize our data here. However, we will not be standardizing each column. At this point of data pre-processing, we have categorial data, binary and numerical. We standardize only numeric data and ignore binary columns (one-hot encoding produces binary columns).
NOTE: feature scaling is done on training, testing and DEV data separately to avoid data leaks.
So, we first calculate mean and standard deviation of each column of Test data and use the standardization formula on every column of DEV and TEST data on respective columns.

In [None]:
#[TRAIN] feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_df_train_mean=numerical_df.mean()
numerical_df_train_std=numerical_df.std(axis=0)
numerical_df_scale =pd.DataFrame(scaler.fit_transform(numerical_df),columns=numerical_columns)

In [None]:
# [TRAIN] Concatenate Columns
X_train = pd.concat([numerical_df_scale, category_df_Final,binary_df], axis=1)

Repeat the same for DEV and TEST data

In [None]:
#[DEV] dividing data into binary, number and category
binary_columns=["HasCrCard","IsActiveMember"]
binary_df=pd.DataFrame(X_Dev[binary_columns])

numerical_columns =["CreditScore","Age","Tenure","Balance","NumOfProducts","EstimatedSalary"]
numerical_df=pd.DataFrame(X_Dev[numerical_columns])

category_columns=['Geography','Gender']
category_df=pd.DataFrame(X_Dev[category_columns])

# [DEV] Encode Categorical Data
category_df['Geography'] = category_df['Geography'].astype('category')
category_df['Gender'] = category_df['Gender'].astype('category')
category_df_Final = pd.get_dummies(category_df)

# [DEV] feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_df["CreditScore"]=(numerical_df["CreditScore"]-numerical_df_train_mean["CreditScore"]).div(numerical_df_train_std["CreditScore"])
numerical_df["Age"]=(numerical_df["Age"]-numerical_df_train_mean["Age"]).div(numerical_df_train_std["Age"])
numerical_df["Tenure"]=(numerical_df["Tenure"]-numerical_df_train_mean["Tenure"]).div(numerical_df_train_std["Tenure"])
numerical_df["Balance"]=(numerical_df["Balance"]-numerical_df_train_mean["Balance"]).div(numerical_df_train_std["Balance"])
numerical_df["NumOfProducts"]=(numerical_df["NumOfProducts"]-numerical_df_train_mean["NumOfProducts"]).div(numerical_df_train_std["NumOfProducts"])
numerical_df["EstimatedSalary"]=(numerical_df["EstimatedSalary"]-numerical_df_train_mean["EstimatedSalary"]).div(numerical_df_train_std["EstimatedSalary"])

#[DEV] Concatenate Columns
X_Dev = pd.concat([numerical_df, category_df_Final,binary_df], axis=1)

In [None]:
# [TEST] dividing data into binary, number and category
binary_columns=["HasCrCard","IsActiveMember"]
binary_df=pd.DataFrame(X_test[binary_columns])

numerical_columns =["CreditScore","Age","Tenure","Balance","NumOfProducts","EstimatedSalary"]
numerical_df=pd.DataFrame(X_test[numerical_columns])

category_columns=['Geography','Gender']
category_df=pd.DataFrame(X_test[category_columns])

# [TEST] Encode Categorical Data
category_df['Geography'] = category_df['Geography'].astype('category')
category_df['Gender'] = category_df['Gender'].astype('category')
category_df_Final = pd.get_dummies(category_df)

# [TEST] feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_df["CreditScore"]=(numerical_df["CreditScore"]-numerical_df_train_mean["CreditScore"]).div(numerical_df_train_std["CreditScore"])
numerical_df["Age"]=(numerical_df["Age"]-numerical_df_train_mean["Age"]).div(numerical_df_train_std["Age"])
numerical_df["Tenure"]=(numerical_df["Tenure"]-numerical_df_train_mean["Tenure"]).div(numerical_df_train_std["Tenure"])
numerical_df["Balance"]=(numerical_df["Balance"]-numerical_df_train_mean["Balance"]).div(numerical_df_train_std["Balance"])
numerical_df["NumOfProducts"]=(numerical_df["NumOfProducts"]-numerical_df_train_mean["NumOfProducts"]).div(numerical_df_train_std["NumOfProducts"])
numerical_df["EstimatedSalary"]=(numerical_df["EstimatedSalary"]-numerical_df_train_mean["EstimatedSalary"]).div(numerical_df_train_std["EstimatedSalary"])

# [TEST] Concatenate Columns
X_test = pd.concat([numerical_df, category_df_Final,binary_df], axis=1)

In [None]:
# assigning NULL to unused variables
df=None
X=None
y=None
binary_columns=None
binary_df=None
category_df=None
category_columns=None
category_df_Final=None
numerical_df=None
numerical_columns=None
numerical_df_train_mean=None
scaler=None
numerical_df_train_std=None
numerical_df_scale=None
null_columns=None

**Building Model**

**Training 1:**

Hidden Layer(s) : 3

Neurons per Hidden Layer(s): 3 ,3 , 2

Activation function for hidden layer (s) : tanh

Optimizer :Adam

Learning Rate : 0.01

Epochs : 100

Batch size : 32

Early stopping : True

Patience / Tolerance : 2

Initial Weights : uniform distribution within [-limit, limit] where limit is sqrt(6 / fan_in) where fan_in is the number of input units

Initial Bias : Ones


In [None]:
#defining and compiling model
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.optimizers import SGD

def deep_model():
    classifier = Sequential()
    classifier.add(Dense(units=3, kernel_initializer='he_uniform',
                bias_initializer='ones', activation='tanh', input_dim=13))
    classifier.add(Dense(units=3, kernel_initializer='he_uniform',
                bias_initializer='ones', activation='tanh'))
    classifier.add(Dense(units=2, kernel_initializer='he_uniform',
                bias_initializer='ones', activation='tanh'))
    #classifier.add(Dense(units=3, kernel_initializer='he_uniform',
                #bias_initializer='ones', activation='tanh'))
    #classifier.add(Dense(units=2, kernel_initializer='he_uniform',
                #bias_initializer='ones', activation='relu'))
    classifier.add(Dense(units=1,  kernel_initializer='he_uniform',
                bias_initializer='ones', activation='sigmoid'))
    classifier.compile(optimizer=Adam(learning_rate=0.01, amsgrad=False), 
    #classifier.compile(optimizer=SGD(learning_rate=0.001, momentum=0.8, nesterov=False), 
    loss='binary_crossentropy', 
    metrics=['accuracy','mae'])
    return classifier

In [None]:
# fitting the data 
from keras.models import load_model
from keras.callbacks import EarlyStopping, ModelCheckpoint
classifier = deep_model()
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=2),
             ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
output=classifier.fit(X_train, y_train, batch_size=32,callbacks=callbacks ,epochs=100,validation_data=(X_Dev,y_Dev),shuffle=False)


In [None]:
#plotting
print(output.history.keys())
import matplotlib.pyplot as plt
# summarize history for accuracy
plt.plot(output.history['accuracy'])
plt.plot(output.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'Validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(output.history['loss'])
plt.plot(output.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'Validation'], loc='upper left')
plt.show()

In [None]:
#Calculating Errors
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_absolute_error


#Confusion Matric Accuracy
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)*1
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy = (cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1])
print("Confusion Matrix Accuracy: "+ str(accuracy*100)+"%")

#F1 score
recall=(cm[0][0])/(cm[0][0]+cm[0][1])
precision=(cm[0][0])/(cm[0][0]+cm[1][0])
F1=(2*recall*precision)/(precision+recall)
print("F1 Score:"+str(F1))

#MAE
mae=mean_absolute_error(y_test, y_pred)
print("MAE:"+str(mae))