## Problem Statement

`Customer churn`` occurs when customers stop doing business with a company, also known as customer attrition. It is also referred to as loss of clients or customers.

Imagine the following data set contains sensitive information of 9,000 of a European Bank, EBQ. Using an Artificial Neural Network (ANN) based on the dataset we will attempt to correctly predict who is going to leave next.
Here is a breakdown of the features in the dataset

* `CustomerId`: a unique identifier for each customer within the dataset. These values are not ordered sequentially within the dataset, and are only used to identify a specific customer. It typically does not have any influence to whether a customer leaves the business.
* `Surname`: A string used to identify the customer in the dataset. Surname may be distinct amidst all or most customers. Because of this, it most likely won't affect the target variable. 
* `CreditScore`: a numeric representation of the customer's individual fiscal credit score. Typically used to indicate eligibility for loans. Current credit scores use a range from 300 to 850, but the FICO auto score range uses 250-900. This feature likely determines retention rate of customers. 
* `Geography`: this feature contains a categorical string representing the name of a country the customer is from originally. 
* `Gender`: this feature contains a categorical string representing the gender of the customer ("Male"/"Female"). 
* `Age`: a numerical integer representation of a customer's age. Intuition suggests that older customers are likely to have higher retention than younger customers.
* `Tenure`: a numerical integer representation. It is assumed that this feature represents the number of total years the customer has been retained. It is likely that customers which have been retained longer will continue to be retained.
* `Balance`: a numerical floating point number (to two decimal places of precision) indicating the customer's current bank balance (assumed total across all accounts). Customers with a greater balance may be less likely to exit the account due to difficulty of transfer. 
* `NumOfProducts`: numeric integer value. It is assumed that this value represents the number of accounts (products) that this customer has open. Further evaluation of this feature would be needed to determine the usefulness of this feature, but at face-value, intuition dictates that a customer with more products is less likely to exit. 
* `HasCrCard`: boolean flag (0 or 1) representing whether the customer has a credit card or not. 
* `IsActiveMember`: boolean flag (0 or 1) representing whether the customer is an active member of the bank. It is assumed this indicates whether the customer has transactions on the regular banking statement. Intuition dictates that inactive members are more likely to exit. 
* `EstimatedSalary`: numerical floating point representation of the customer's predicted salary (to two decomal places) intuition dictates that customers with different incomes may behave differently with respect to retention rate. 
* `Exited`: boolean flag (0 or 1) representing whether the customer has exited their account. This is the target variable for the dataset.

## Data Summary
* The following function reads the dataset from a file, constructs the dataframe in pandas and provides a summary of much of the relevant data giving a brief overview

In [1]:
import pandas as pd
import keras
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix


In [2]:
def summarize_dataset(csv_file):
    data_set = pd.read_csv(csv_file)
    # Count and print the number of rows
    a = len(data_set.index)
    print('total number of rows =  %d' % a)
    # Count and print the number of columns
    b = len(data_set.columns)
    print('total number of columns =  %d' % b)
    # The describe function drops the non-numerical columns, subtract this from the total number
    c = b - len(data_set.describe().columns)
    print('number of columns having non-numeric values = %d' % c)
    missing_data_tuples = []
    # Loop through the columns, if a column contains missing values note that column and sum the number of missing values
    for column in data_set:
        if data_set[column].isna().sum() > 0:
            d = column
            e = data_set[column].isna().sum()
            missing_data_tuples.append((d, e))
    missing_data_tuples = sorted(missing_data_tuples, key=lambda x: (x[1], x[1]), reverse=True)
    print('columns with missing values = {0}'.format(missing_data_tuples))
    f1 = 'Male'
    f2 = 'Female'
    # Divide the number of each gender who exited by the total number of each gender
    g1 = len(data_set[(data_set['Gender'] == f1) & data_set['Exited'] == 1]) / len(data_set[data_set['Gender'] == f1])
    g2 = len(data_set[(data_set['Gender'] == f2) & data_set['Exited'] == 1]) / len(data_set[data_set['Gender'] == f2])
    g1 = f'{g1:.2%}'
    g2 = f'{g2:.2%}'
    gen_exit = [(f1, g1), (f2, g2)]
    gen_exit = sorted(gen_exit, key=lambda x: (x[1], x[1]), reverse=True)
    print('gender based summary of exited column = {}'.format(gen_exit))
    # Divide the number of those in each age group who exited by the total number of each age group
    h1 = len(data_set[(data_set['Age'] <= 40) & data_set['Exited'] == 1]) / len(data_set[data_set['Age'] <= 40])
    h2 = len(data_set[(data_set['Age'] > 40) & data_set['Exited'] == 1]) / len(data_set[data_set['Age'] > 40])
    h1 = f'{h1:.2%}'
    h2 = f'{h2:.2%}'
    age_exit = [('below or equal to 40', h1), ('above 40', h2)]
    print('age based summary of exited column = {}'.format(age_exit))
    # Calculate the mean and standard deviation
    i = data_set['CreditScore'].mean()
    j = data_set['CreditScore'].std()
    print('credit score summary = %.2f +/- %.2f' % (i, j))
    return data_set
dataset = summarize_dataset('dataset/datasetX.csv')

total number of rows =  9000
total number of columns =  13
number of columns having non-numeric values = 3
columns with missing values = [('Age', 397), ('CreditScore', 26)]
gender based summary of exited column = [('Female', '24.77%'), ('Male', '16.63%')]
age based summary of exited column = [('below or equal to 40', '10.94%'), ('above 40', '37.63%')]
credit score summary = 650.25 +/- 96.75


### Preproccessing of the data before training the ANN

In [3]:
# Drop irrelevant data columns
dataset_dropped = dataset.drop(['CustomerId', 'Surname'], axis=1)

In [4]:
# Shuffle the dataset based on the provided seed
dataset_shuffled = dataset_dropped.sample(frac=1, random_state=4321)

In [5]:
# Divide data into x and y values
X = dataset_shuffled.drop(['Exited'], axis=1)
y = dataset_shuffled['Exited']

In [6]:
# Split and shuffle the dataset into appropriate groups
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4321, test_size=.20, shuffle=True)

In [7]:
# Perform the One Hot Encoding on the training set using the scikitlearn OneHotEncoder and use that encoding to OHE the test set
# Drop the first generated column for each OHE category to avoid the dummy variable trap and fit the dataset within the required 11 variable input for the model
enc = OneHotEncoder(sparse_output=False, drop='first')
columns_to_one_hot = ['Geography', 'Gender']
encoded_array = enc.fit_transform(X_train.loc[:,columns_to_one_hot])
df_encoded = pd.DataFrame(encoded_array, columns=enc.get_feature_names_out())
X_train_ohe = pd.concat([X_train.reset_index(drop=True), df_encoded.reset_index(drop=True)],axis=1)
X_train_ohe.drop(labels=columns_to_one_hot, axis=1, inplace=True)

enc.transform(X_test.loc[:,columns_to_one_hot])
encoded_array = enc.transform(X_test.loc[:,columns_to_one_hot])
df_encoded_2 = pd.DataFrame(encoded_array, columns=enc.get_feature_names_out())
X_test_ohe = pd.concat([X_test.reset_index(drop=True), df_encoded_2.reset_index(drop=True)], axis=1)
X_test_ohe.drop(labels=columns_to_one_hot,axis=1,inplace=True)

In [8]:
def normalize_data(df):
    for column in df.columns:
        df[column] = (df[column] - df[column].mean()) / df[column].std()
        df[column].fillna(df[column].median(), inplace=True)
    return df
X_train_scaled = normalize_data(X_train_ohe)
X_test_scaled = normalize_data(X_test_ohe)

### Testing and validation
Here are three different ANN architectures used to make predictions based on the dataset.
After training and testing the model we will also use the model to make a single prediction for one element in the test set where we know the Exited value and one where we do not and attempt to draw conclusions based on the features


## First ANN Architecture
  ![Display Network Architecture](figs/nn-1.png)
  * **Input layer** will have 11 units as the dimension of training set: `X_train_scaled` (i.e, number of columns = 11).
  * **First hidden layer** will have 5 neurons, each with "Rectified Linear Unit (`ReLU``)" as activation function.
  * **Second hidden layer** will have 4 neurons, each with "`ReLU`" as activation function.
  * **Output layer** will have just 1 neuron, with `sigmoid`` activation function. 

In [9]:
"""
known_test_x and unknown_test_x are selected from the scaled dataset using an index to have scaled and normalized data.
known_test_inspect and unknown_test_inspect are the same elements from the dataset selected using the same index but they
relate to the data before it has been normalized so it is easier to draw conclusions via visual inspection
"""
known_test_x = pd.DataFrame(X_test_scaled.loc[100])
known_test_inspect = pd.DataFrame(X_test.iloc[100])
known_test_x =  known_test_x.transpose()
known_test_y = y_test.iloc[100]
unknown_test_x = pd.DataFrame(X_test_scaled.loc[928])
unknown_test_inspect = pd.DataFrame(X_test.iloc[928])
unknown_test_x = unknown_test_x.transpose()

In [10]:
# Creating first model based on an 11 input follow by 5 node and 4 node hidden layers and 1 output node
model = keras.Sequential()
model.add(layers.Dense(5, activation='relu', input_shape=(11,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_scaled, y_train, epochs=25)
model.save('saved_models/model-ann-11-5-4-1.keras')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [11]:
def eval_model(y_test, y_predict):
    # Binarize the output prediction vector
    y_predict[y_predict <= 0.5] = 0
    y_predict[y_predict > 0.5] = 1
    # Build the confusion matrix
    cm = confusion_matrix(y_test, y_predict)
    # Assign the true/false negative/positive
    TN = cm[0,0]
    TP = cm[1,1]
    FN = cm[1,0]
    FP = cm[0,1]
    # Calculate accuracy, precision, recall, and F1 scores
    acc = (TN + TP) / (TN + TP + FN + FP)
    prec = TP / (TP + FP)
    rec = TP / (TP + FN)
    f1 = 2 * (prec * rec) / (prec + rec)
    return acc, prec, rec, f1

In [12]:
def binarize(n):
    if n <= 0.5:
        n = 0
    elif n > 0.5:
        n = 1
    return n

In [13]:
"""
Load the model and run it on the test data set
Take the results from the test data set and evaluate them with the eval_model function for specified eval metrics
"""
new_model1 = keras.models.load_model('saved_models/model-ann-11-5-4-1.keras')
results = new_model1.evaluate(X_test_scaled, y_test)
y_predict1 = new_model1.predict(X_test_scaled)
a, p, r, f = eval_model(y_test, y_predict1)
print("Accuracy: {}".format(a))
print("Precision {}".format(p))
print("Recall {}".format(r))
print("F1 {}".format(f))

Accuracy: 0.8438888888888889
Precision 0.7195767195767195
Recall 0.37362637362637363
F1 0.49186256781193494


In [14]:
"""
Use the model to predict the value for a single known test set element and the single unknown test element set
Also binarize the output for easy comparison
"""
predict_known_x = new_model1.predict(known_test_x)
predict_unknown_x = new_model1.predict(unknown_test_x)
predict_known_x = binarize(predict_known_x)
predict_unknown_x = binarize(predict_unknown_x)
print("Predicted value for known data: {} \nKnown value: {}".format(predict_known_x, known_test_y))
if predict_known_x == known_test_y:
    print("The model correctly predicted the 'Exited' value\n")
else:
    print("The model did not correctly predict the 'Exited' value\n")
print("Unknown 'Exited' status predicted value: {}".format(predict_unknown_x))

Predicted value for known data: 0 
Known value: 0
The model correctly predicted the 'Exited' value

Unknown 'Exited' status predicted value: 0


## Our model has correctly predicted our output for the known element from the test set

## Second ANN Architecture

![Display second ANN architecture](figs/nn-2.png)

* Input layer will still have 11 units as the dimension of training set (i.e, number of columns = 11).
* Hidden-layer-1: 8 neurons, with relu activation
* Hidden-layer-2: 8 neurons, with relu activation,
* Hidden-layer-3: 8 neurons, with relu activation,
* Output-layer: 1 neuron with sigmoid.

In [15]:
# Build the second model with the required specifications
model2 = keras.Sequential()
model2.add(layers.Dense(8, activation='relu', input_shape=(11,)))
model2.add(layers.Dense(8, activation='relu'))
model2.add(layers.Dense(8, activation='relu'))
model2.add(layers.Dense(1, activation='sigmoid'))
model2.compile(optimizer='adam', loss = 'binary_crossentropy', metrics=['accuracy'])
model2.fit(X_train_scaled, y_train, epochs=25)
model2.save('saved_models/model-ann-11-8-8-8-1.keras')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [16]:
"""
Load the model and run it on the test data set
Take the results from the test data set and evaluate them with the eval_model function for specified eval metrics
"""
new_model2 = keras.models.load_model('saved_models/model-ann-11-8-8-8-1.keras')
y_predict2 = new_model2.predict(X_test_scaled)
a, p, r, f = eval_model(y_test, y_predict2)
print("Accuracy: {}".format(a))
print("Precision {}".format(p))
print("Recall {}".format(r))
print("F1 {}".format(f))

Accuracy: 0.8538888888888889
Precision 0.6980392156862745
Recall 0.489010989010989
F1 0.5751211631663974


In [17]:
"""
Use the model to predict the value for a single known test set element and the single unknown test element set
Also binarize the output for easy comparison
"""
predict_known_x = new_model2.predict(known_test_x)
predict_unknown_x = new_model2.predict(unknown_test_x)
predict_known_x = binarize(predict_known_x)
predict_unknown_x = binarize(predict_unknown_x)
print("Predicted value for known data: {} \nKnown value: {}".format(predict_known_x, known_test_y))
if predict_known_x == known_test_y:
    print("The model correctly predicted the 'Exited' value\n")
else:
    print("The model did not correctly predict the 'Exited' value\n")
print("Unknown 'Exited' status predicted value: {}".format(predict_unknown_x))

Predicted value for known data: 0 
Known value: 0
The model correctly predicted the 'Exited' value

Unknown 'Exited' status predicted value: 0


## Our model has correctly predicted our output for the known element from the test set

## Third ANN Architecture

![Display third ANN architecture](figs/nn-3.png)

* Input layer will still have 11 units as the dimension of training set (i.e, number of columns = 11).
* Hidden-layer-1: 8 neurons, with relu activation
* Hidden-layer-2: 4 neurons, with relu activation,
* Hidden-layer-3: 2 neurons, with relu activation,
* Output-layer: 1 neuron with sigmoid.

In [18]:
# Build the third model with the required specifications
model3 = keras.Sequential()
model3.add(layers.Dense(8, activation='relu', input_shape=(11,)))
model3.add(layers.Dense(4, activation='relu'))
model3.add(layers.Dense(2, activation='relu'))
model3.add(layers.Dense(1, activation='sigmoid'))
model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model3.fit(X_train_scaled, y_train, epochs=25)
model3.save('saved_models/model-ann-11-8-4-2-1.keras')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [19]:
"""
Load the model and run it on the test data set
Take the results from the test data set and evaluate them with the eval_model function for specified eval metrics
"""
new_model3 = keras.models.load_model('saved_models/model-ann-11-8-4-2-1.keras')
y_predict3 = new_model3.predict(X_test_scaled)
a, p, r, f = eval_model(y_test, y_predict3)
print("Accuracy: {}".format(a))
print("Precision {}".format(p))
print("Recall {}".format(r))
print("F1 {}".format(f))

Accuracy: 0.8527777777777777
Precision 0.7180616740088106
Recall 0.4478021978021978
F1 0.5516074450084603


In [20]:
"""
Use the model to predict the value for a single known test set element and the single unknown test element set
Also binarize the output for easy comparison
"""
predict_known_x = new_model2.predict(known_test_x)
predict_unknown_x = new_model2.predict(unknown_test_x)
predict_known_x = binarize(predict_known_x)
predict_unknown_x = binarize(predict_unknown_x)
print("Predicted value for known data: {} \nKnown value: {}".format(predict_known_x, known_test_y))
if predict_known_x == known_test_y:
    print("The model correctly predicted the 'Exited' value\n")
else:
    print("The model did not correctly predict the 'Exited' value\n")
print("Unknown 'Exited' status predicted value: {}".format(predict_unknown_x))

Predicted value for known data: 0 
Known value: 0
The model correctly predicted the 'Exited' value

Unknown 'Exited' status predicted value: 0


## Our model has correctly predicted our output for the known element from the test set

In [21]:
print(known_test_inspect)

                      3861
CreditScore          571.0
Geography           France
Gender              Female
Age                   33.0
Tenure                   9
Balance          102017.25
NumOfProducts            2
HasCrCard                0
IsActiveMember           0
EstimatedSalary  128600.49


In [22]:
print(unknown_test_inspect)

                     1080
CreditScore         549.0
Geography           Spain
Gender             Female
Age                  24.0
Tenure                  9
Balance               0.0
NumOfProducts           2
HasCrCard               1
IsActiveMember          1
EstimatedSalary  14406.41


## Conclusions around classification
From these two specific examples, it looks like what these two data elements have in common are that they are both females, both have credit scores in the mid 500's, purchased the same number of products, are both under the age of 40, and both have a tenure of 9
Without drawing too many conclusions from limited data, it looks like women under 40 with lower credit scores may be more likely to be classified as 0 for 'Exited' in the model.  These two data samples have many differences, especially when it comes to their financial balances and salaries.