# Churn Modelling with Light GBM and ANN

Jonathan Lices Martín

In this notebook we're gonna try to understand the basic implementation of an ANN with some different methods, and compare our results with one of the most popular methods in Kaggle competitions, Light GBM. So we have some interesting objectives from now, let's do it!

## Understanding the problem and the data

This dataset is prepared to try to predict/determine if a bank's client will leave it or not, by using information like credit score, salary, etc. So, since we have to say wether the client is going to leave the bank or not, we expect a **binary** output from our ANN.

Like a great data scientist would say, the first step is to explore the data.

In [None]:
# Import packages and libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score

import lightgbm as lgb

import keras
from keras.models import Sequential 
from keras.layers import Dense 
from keras.layers import Dropout

In [None]:
# Import the data with pandas

data = pd.read_csv("../input/churn-modelling/Churn_Modelling.csv")
data_copy = data.copy() # Just in case

data.head()

Now, we've imported the data succesfully. It's time to explore it and try to understand the dataset. To start with, let's try to describe every column.

In [None]:
# Dataset columns

print("The names of the columns are:", data.columns)

In [None]:
# Dataset statistical description

data.describe

As we can see, we have some relevant information here, and the rest, maybe we won't need it at all, so we can delete it. The column names are really explicit, so we can easily infer the what are we seeing in this dataset. Our objective now is to preprocess the data. 

## Preprocessing the data

One of the first things we can think about when we are going to do a Machine Learning project is whether the dataset is complete or not; that is to say, do we have **missing values**?

In [None]:
# Missing values

total = data.isnull().sum().sort_values(ascending=False)
porcentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, porcentage], axis=1, keys=['Total', 'Porcentage'])
missing_data.head(20)

So we have a complete dataset, now we can delete the non-relevant information. To do this we'll juist simply drop that columns.

In [None]:
# Removing non-relevant information

not_featured_cols = ["RowNumber", "CustomerId", "Surname"]
data = data.drop(not_featured_cols, axis = 1)

data.head()

## Data Exploration

Can we expect some correlation between the data? Which characteristic is more important? These are some of the questions we have to answer. Let's see it with more detail with a correlation plot.

In [None]:
# Correlation plot

corr = data.corr()

sns.set()
fig, ax = plt.subplots(figsize = (15,15))
ax = sns.heatmap(corr, annot = True, linewidths = 1.0)
ax.set_title("Correlation Plot")

We don't have strong correlations at all. We can continue our data exploration by studying some information that may be useful.

In [None]:
# Visualizing columns

fig = sns.countplot(data["Geography"])
fig.set_title("Geopgraphy Counting")

plt.show()

In [None]:
fig = sns.countplot(data["Gender"])
fig.set_title("Gender Counting")

plt.show()

In [None]:
# Display min and max age.

print("The maximum age is:", data["Age"].max())
print("The minimum age is:", data["Age"].min())

As we can see, we hace people from their 18 to their 92, there are more men than women, and the half of the dataset is from France. So we can expect an european bank originated in France, with some offices in Spain and Germany.

Now, we are prepared to process the data and build our models.

## Building the models

The first thing we have to do is prepare the data to be able to build a model. We hava some categorical data here, so let's use OneHotEncoder to solve this problem. But first, we have to split the data in two groups.

In [None]:
# Split the data

X = data.iloc[:, :10].values
y = data.iloc[:, 10].values

In [None]:
# Encoding categorical features

labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # 'Geography' 
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) # 'Gender'



transformer = ColumnTransformer(
    transformers=[
        ("Churn_Modelling", # Name for transormation
        OneHotEncoder(categories='auto'), # Class we want transform
        [1] # Columns
        )
    ], remainder='passthrough'
)
X = transformer.fit_transform(X)
X = X[:, 1:] # Avoiding multicollinearity

In [None]:
# Last but not least, splitting the data in training and testing groups

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 42)

## ANN implementation

Now we can build our Multi-layer perceptron or Artificial Neural Network. Bur, what are artificial neural networks?

> *Artificial neural networks (ANNs), usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. [Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network)*

Actually, this notebook is not as interested in the theory as in the practice, so let's build our model.

Another interesting method is using Autokeras. You can learn more about this [here](https://towardsdatascience.com/automl-creating-top-performing-neural-networks-without-defining-architectures-c7d3b08cddc).

In [None]:
# Scaling data

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
# Building the model

classifier = Sequential()

# First layer
classifier.add(Dense(units = 6, kernel_initializer = "uniform", activation = "relu", input_dim = 11))
classifier.add(Dropout(rate = 0.1))

# Second layer
classifier.add(Dense(units = 6, kernel_initializer = "uniform", activation = "relu"))
classifier.add(Dropout(rate = 0.1))

# Output layer
classifier.add(Dense(units = 1, kernel_initializer = "uniform", activation = "sigmoid"))

# Compiler
classifier.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

In [None]:
# LET'S TRAIN!

classifier.fit(X_train, y_train,  batch_size = 10, epochs = 100)

In [None]:
# Evaluating the model

y_pred = classifier.predict(X_test) 
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)

def plot_confusion_matrix(df_confusion, title='Confusion matrix'):
    sns.set()
    ax= plt.subplot()
    sns.heatmap(df_confusion, annot=True, ax = ax, cmap='coolwarm')
    ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels');
    ax.set_title('Confusion Matrix');
plot_confusion_matrix(cm_df)

So we can see the importance of scaling the data. 

## Light GBM Implementation

If you are in Kaggle right now, it's probably that you heared something about Light GBM, but what is this? 

> *Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’. [analyticsvidhya](https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/)*

Let's try to make that implementation. Remember we've already scaled the data!

In [None]:
# Building the model specifically for LGBM

training_data = lgb.Dataset(data = X_train, label = y_train)
params = {'num_leaves': 31, 'num_trees': 100, 'objective': 'binary'}
params['metric'] = ['auc', 'binary_logloss']
classifier = lgb.train(params = params,
                       train_set = training_data,
                       num_boost_round = 10)

In [None]:
# Making predictions with test set

prob_pred = classifier.predict(X_test)
y_pred = np.zeros(len(prob_pred))
for i in range(0, len(prob_pred)):
    if prob_pred[i] >= 0.5:
       y_pred[i] = 1
    else:  
       y_pred[i] = 0

In [None]:
# Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)
plot_confusion_matrix(cm_df)

In [None]:
# Getting the accuracy

accuracy = accuracy_score(y_pred, y_test) * 100
print("Accuracy: {:.0f} %".format(accuracy))

In [None]:
# K-FOLD CROSS VALIDATION

params = {'num_leaves': 31, 'num_trees': 100, 'objective': 'binary'}
params['metric'] = ['auc']
cv_results = lgb.cv(params = params,
                    train_set = training_data,
                    num_boost_round = 10,
                    nfold = 10)
average_auc = np.mean(cv_results['auc-mean'])
print("Average AUC: {:.0f} %".format(accuracy))

As we can see, we've obtained the best accuracy with Light GBM. That is to say, we've made a simple model (in programming terms) and obtained better results. Maybe a ANN can do it better, but we should have searched the best params for it, and to be honest, in a real job we don't have that much time!

Thanks for reading my notebook and hope it was useful for you. 

Please upvote if you like it!