# 1. Introduction: Business Goal & Problem Definition

This project´s goal is creating a model to identify the potential risk of clients leaving the company, what I understand is a crucial information for the business. The concerning dataset is called "Predicting Churn for Bank Customers" and it´s available in Kaggle. We´ll use the following 10 features below to create the model:

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

* Credit Score
* Geography
* Gender
* Age
* Tenure
* Balance
* Number of products
* Credit card possession
* Is active member
* Salary

# 2. Importing Basic Libraries

In [None]:
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 3. Data Collection

In [None]:
#For this exercise I understand it´s a better exercise if we do the analysis by country in order to have a more accurate analyzis by market

# geo = input("Select the country you´d like to analyze (France / Germany / Spain): ")

churn_ds = pd.read_csv("../input/predicting-churn-for-bank-customers/Churn_Modelling.csv", sep=",")
churn_ds = churn_ds.loc[churn_ds["Geography"] == "Germany"]

churn_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
churn_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

churn_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(churn_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(churn_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

churn_ds.duplicated().sum()

In [None]:
#Checking data balancing (for classification)

data_balancing = pd.DataFrame()
data_balancing["Count"] = churn_ds["Exited"].value_counts()
data_balancing["Count%"] = churn_ds["Exited"].value_counts()/churn_ds.shape[0]*100

data_balancing

In [None]:
#Checking basic statistical data by feature

churn_ds.describe(include="all")

# 5. Data Preparation

    We´ll perform the following:

    1. Create a column that will be the "Balance" divided by the "Salary", what could potentially bring relevant information to the model


    2. Remove columns that don´t add any value to the model: "RowNumber", "CustomerId", "Surname" and "Geography"


    3. Convert categorical variables to dummies: "Gender"


    * No duplicated rows found
    * No outliers found

In [None]:
#1

churn_ds["BalanceSalaryProportion"] = (churn_ds["Balance"]/churn_ds["EstimatedSalary"])
churn_ds.reset_index(drop=True)

In [None]:
#2

churn_ds.drop(["RowNumber", "CustomerId", "Surname", "Geography"], axis=1, inplace=True)

In [None]:
#3

churn_ds = pd.concat([churn_ds, pd.get_dummies(churn_ds["Gender"], prefix="Gender")], axis=1)

churn_ds.to_excel("churn_ds_clean.xlsx")

# 6. Data Exploration

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2)
churn_ds["Exited"].value_counts().plot.bar(color="purple", ax=ax[0])
churn_ds["Exited"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Exited Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
churn_ds["Gender"].value_counts().plot.bar(color="purple", ax=ax[0])
churn_ds["Gender"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Gender Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
churn_ds["HasCrCard"].value_counts().plot.bar(color="purple", ax=ax[0])
churn_ds["HasCrCard"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("HasCrCard Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
churn_ds["IsActiveMember"].value_counts().plot.bar(color="purple", ax=ax[0])
churn_ds["IsActiveMember"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("IsActiveMember Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

In [None]:
#Plotting Numerical Variables

fig, ax = plt.subplots(1,3)
fig.suptitle("CreditScore Distribution", fontsize=15)
sns.distplot(churn_ds["CreditScore"], ax=ax[0])
sns.boxplot(churn_ds["CreditScore"], ax=ax[1])
sns.violinplot(churn_ds["CreditScore"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("Age Distribution", fontsize=15)
sns.distplot(churn_ds["Age"], ax=ax[0])
sns.boxplot(churn_ds["Age"], ax=ax[1])
sns.violinplot(churn_ds["Age"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("Tenure Distribution", fontsize=15)
sns.distplot(churn_ds["Tenure"], ax=ax[0])
sns.boxplot(churn_ds["Tenure"], ax=ax[1])
sns.violinplot(churn_ds["Tenure"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("Balance Distribution", fontsize=15)
sns.distplot(churn_ds["Balance"], ax=ax[0])
sns.boxplot(churn_ds["Balance"], ax=ax[1])
sns.violinplot(churn_ds["Balance"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("NumOfProducts Distribution", fontsize=15)
sns.distplot(churn_ds["NumOfProducts"], ax=ax[0])
sns.boxplot(churn_ds["NumOfProducts"], ax=ax[1])
sns.violinplot(churn_ds["NumOfProducts"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("EstimatedSalary Distribution", fontsize=15)
sns.distplot(churn_ds["EstimatedSalary"], ax=ax[0])
sns.boxplot(churn_ds["EstimatedSalary"], ax=ax[1])
sns.violinplot(churn_ds["EstimatedSalary"], ax=ax[2])

fig, ax = plt.subplots(1,3)
fig.suptitle("BalanceSalaryProportion Distribution", fontsize=15)
sns.distplot(churn_ds["BalanceSalaryProportion"], ax=ax[0])
sns.boxplot(churn_ds["BalanceSalaryProportion"], ax=ax[1])
sns.violinplot(churn_ds["BalanceSalaryProportion"], ax=ax[2])

In [None]:
#Alternatively using Profile Report to see variables statistics and correlations

# from pandas_profiling import ProfileReport
# profile = ProfileReport(churn_ds, title="Customer Index")
# profile.to_file(output_file="Customer_Churn.html")

# 7. Correlations Analysis & Features Selection

In [None]:
#Deleting original categorical columns

churn_ds.drop(["Gender"], axis=1, inplace=True)

#Plotting a Heatmap

fig, ax = plt.subplots(1, figsize=(25,25))
sns.heatmap(churn_ds.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

#Plotting a Pairplot

sns.pairplot(churn_ds)

In [None]:
#Plotting a Feature Importance

from xgboost import XGBClassifier
from matplotlib import pyplot
#Defining Xs and y
X = churn_ds.drop(["Exited"], axis=1)
y = churn_ds["Exited"]
#Defining the model
model = XGBClassifier().fit(X, y)
#Getting importance
importance = model.feature_importances_
#Summarizing feature importance
for i,v in enumerate(importance):
    print("Feature:{0:}, Score:{1:,.4f}".format(X.columns[i], v))
#Plotting feature importance
pd.Series(model.feature_importances_[::-1], index=X.columns[::-1]).plot(kind="barh", figsize=(25,25))

# 8. Data Modelling

In [None]:
#Defining Xs and y

X = churn_ds[["NumOfProducts", "IsActiveMember", "Age", "Gender_Female", "Balance", "HasCrCard", "BalanceSalaryProportion", "EstimatedSalary", "CreditScore", "Tenure"]]
y = churn_ds[["Exited"]]

#Scaling all features

from sklearn.preprocessing import MinMaxScaler
sc_X = MinMaxScaler()
X_scaled = sc_X.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

#Setting train/test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=0)

# 9. Machine Learning Algorithms Implementation & Assessment

# 9.1 Logistic Regression

In [None]:
#Creating a Logistic Regression model and checking its Metrics

from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

#Trying different polynomial degrees
degrees = [1, 2, 3, 4, 5]
print("Testing degrees:")
for a in degrees:
    poly = PolynomialFeatures(degree=a)
    X_train_degree = poly.fit_transform(X_train)
    X_test_degree = poly.fit_transform(X_test)
    model_lr = linear_model.LogisticRegression(max_iter=1000000000).fit(X_train_degree, y_train.values.ravel())
    y_preds_train = model_lr.predict(X_train_degree)
    y_preds_test = model_lr.predict(X_test_degree)
    accuracy_train = accuracy_score(y_train, y_preds_train)
    accuracy_test = accuracy_score(y_test, y_preds_test)
    precision_train = precision_score(y_train, y_preds_train)
    precision_test = precision_score(y_test, y_preds_test)
    recall_train = recall_score(y_train, y_preds_train)
    recall_test = recall_score(y_test, y_preds_test)
    f1_train = f1_score(y_train, y_preds_train)
    f1_test = f1_score(y_test, y_preds_test)
    print("Train: Degree:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_train, precision_train, recall_train, f1_train))
    print("Test : Degree:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_test, precision_test, recall_test, f1_test))
print("")

#Choosing the best polynomial degree
chosen_degree = 4
poly = PolynomialFeatures(degree=chosen_degree)

#Working on X_train & X_test in the polynomial chosen degree
X_train_degree = poly.fit_transform(X_train)
X_test_degree = poly.fit_transform(X_test)

#Fitting to the model
model_lr = linear_model.LogisticRegression(max_iter=1000000000).fit(X_train_degree, y_train.values.ravel())
print(f"Linear Regression Intercept: {model_lr.intercept_}")
print(f"Linear Regression Coefficients: {model_lr.coef_}, \n")

#Getting the predictions & Metrics
y_preds_train = model_lr.predict(X_train_degree)
y_preds_test = model_lr.predict(X_test_degree)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Chosen degree:")
print("Train: Degree:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_degree, accuracy_train, precision_train, recall_train, f1_train))
print("Test : Degree:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_degree, accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
X_degree = poly.fit_transform(X_scaled)
y_preds_all = model_lr.predict(X_degree)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_lr.xlsx")

# 9.2 SVM

In [None]:
#Creating a SVM model and checking its Metrics

from sklearn import svm

#Fitting to the model
model_svm = svm.SVC().fit(X_train, y_train.values.ravel())

#Getting the predictions & Metrics
y_preds_train = model_svm.predict(X_train)
y_preds_test = model_svm.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Train: Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_train, precision_train, recall_train, f1_train))
print("Test : Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
y_preds_all = model_svm.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_svm.xlsx")

# 9.3 Naive Bayes

In [None]:
#Creating a Naive Bayes model and checking its Metrics

from sklearn import naive_bayes

#Fitting to the model
model_nb = naive_bayes.MultinomialNB().fit(X_train, y_train.values.ravel())

#Getting the predictions & Metrics
y_preds_train = model_nb.predict(X_train)
y_preds_test = model_nb.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Train: Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_train, precision_train, recall_train, f1_train))
print("Test : Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
y_preds_all = model_nb.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_nb.xlsx")

# 9.4 KNN

In [None]:
#Creating a KNN model and checking its Metrics

from sklearn import neighbors

#Trying different neighbors
n_neighbors = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
print("Testing neighbors:")
for a in n_neighbors:
    model_knn = neighbors.KNeighborsClassifier(n_neighbors=a).fit(X_train, y_train.values.ravel())
    y_preds_train = model_knn.predict(X_train)
    y_preds_test = model_knn.predict(X_test)
    accuracy_train = accuracy_score(y_train, y_preds_train)
    accuracy_test = accuracy_score(y_test, y_preds_test)
    precision_train = precision_score(y_train, y_preds_train)
    precision_test = precision_score(y_test, y_preds_test)
    recall_train = recall_score(y_train, y_preds_train)
    recall_test = recall_score(y_test, y_preds_test)
    f1_train = f1_score(y_train, y_preds_train)
    f1_test = f1_score(y_test, y_preds_test)
    print("Train: Neighbors:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_train, precision_train, recall_train, f1_train))
    print("Test : Neighbors:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_test, precision_test, recall_test, f1_test))
print("")

#Choosing the best neighbor
chosen_neighbor = 13
model_knn = neighbors.KNeighborsClassifier(n_neighbors=chosen_neighbor).fit(X_train, y_train.values.ravel())
y_preds_train = model_knn.predict(X_train)
y_preds_test = model_knn.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Chosen neighbors:")
print("Train: Neighbors:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_neighbor, accuracy_train, precision_train, recall_train, f1_train))
print("Test : Neighbors:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_neighbor, accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
y_preds_all = model_knn.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_knn.xlsx")

# 9.5 Random Forest

In [None]:
#Creating a Random Forest model and checking its Metrics

from sklearn import ensemble

#Trying different depths
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
print("Testing depths:")
for a in depths:
    model_rf = ensemble.RandomForestClassifier(max_depth=a, random_state=0).fit(X_train, y_train.values.ravel())
    y_preds_train = model_rf.predict(X_train)
    y_preds_test = model_rf.predict(X_test)
    accuracy_train = accuracy_score(y_train, y_preds_train)
    accuracy_test = accuracy_score(y_test, y_preds_test)
    precision_train = precision_score(y_train, y_preds_train)
    precision_test = precision_score(y_test, y_preds_test)
    recall_train = recall_score(y_train, y_preds_train)
    recall_test = recall_score(y_test, y_preds_test)
    f1_train = f1_score(y_train, y_preds_train)
    f1_test = f1_score(y_test, y_preds_test)
    print("Train: Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_train, precision_train, recall_train, f1_train))
    print("Test : Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_test, precision_test, recall_test, f1_test))
print("")

#Choosing the best depth
chosen_depth = 8
model_rf = ensemble.RandomForestClassifier(max_depth=chosen_depth, random_state=0).fit(X_train, y_train.values.ravel())
y_preds_train = model_rf.predict(X_train)
y_preds_test = model_rf.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Chosen depth:")
print("Train: Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_depth, accuracy_train, precision_train, recall_train, f1_train))
print("Test : Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_depth, accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
y_preds_all = model_rf.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_rf.xlsx")

# 9.6 XGBoost

In [None]:
#Creating a XGBoost model and checking its Metrics

from xgboost import XGBClassifier

#Trying different depths
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
print("Testing depths:")
for a in depths:
    model_xgbc = XGBClassifier(max_depth=a, objective="multi:softmax", num_class=4, random_state=0).fit(X_train, y_train.values.ravel())
    y_preds_train = model_xgbc.predict(X_train)
    y_preds_test = model_xgbc.predict(X_test)
    accuracy_train = accuracy_score(y_train, y_preds_train)
    accuracy_test = accuracy_score(y_test, y_preds_test)
    precision_train = precision_score(y_train, y_preds_train)
    precision_test = precision_score(y_test, y_preds_test)
    recall_train = recall_score(y_train, y_preds_train)
    recall_test = recall_score(y_test, y_preds_test)
    f1_train = f1_score(y_train, y_preds_train)
    f1_test = f1_score(y_test, y_preds_test)
    print("Train: Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_train, precision_train, recall_train, f1_train))
    print("Test : Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(a, accuracy_test, precision_test, recall_test, f1_test))
print("")

#Choosing the best depth
chosen_depth = 1
model_xgbc = XGBClassifier(max_depth=chosen_depth, objective="multi:softmax", num_class=4, random_state=0).fit(X_train, y_train.values.ravel())
y_preds_train = model_xgbc.predict(X_train)
y_preds_test = model_xgbc.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train)
accuracy_test = accuracy_score(y_test, y_preds_test)
precision_train = precision_score(y_train, y_preds_train)
precision_test = precision_score(y_test, y_preds_test)
recall_train = recall_score(y_train, y_preds_train)
recall_test = recall_score(y_test, y_preds_test)
f1_train = f1_score(y_train, y_preds_train)
f1_test = f1_score(y_test, y_preds_test)
print("Chosen depth:")
print("Train: Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_depth, accuracy_train, precision_train, recall_train, f1_train))
print("Test : Depth:{0:,.0f}, Accuracy:{1:,.3f}, Precision:{2:,.3f}, Recall:{3:,.3f}, F1:{4:,.3f}".format(chosen_depth, accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# confusion_matrix = pd.crosstab(y_test, y_preds_test, rownames=["Actual"], colnames=["Predicted"])
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='0f')

#Visualizing y_pred in the dataset
y_preds_all = model_xgbc.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_xgbc.xlsx")

# 9.7 Deep Learning

In [None]:
#Creating a Deep Learning model and checking its Metrics

from keras import Sequential
from keras.layers import Dense

#Creating a model
model_dl = Sequential()

#Input and First Hidden Layer
model_dl.add(Dense(units=256, activation="relu", input_dim=X_train.shape[-1]))

#Output Layer
model_dl.add(Dense(units=1, activation="sigmoid",))

#Compiling the neural network
model_dl.compile(optimizer="adam", loss="binary_crossentropy", metrics=["binary_accuracy"])

#Fitting to the model
model_dl.fit(X_train, y_train.values.ravel(), epochs=250)

#Getting the predictions & Metrics
y_preds_train = model_dl.predict(X_train)
y_preds_test = model_dl.predict(X_test)
accuracy_train = accuracy_score(y_train, y_preds_train.round())
accuracy_test = accuracy_score(y_test, y_preds_test.round())
precision_train = precision_score(y_train, y_preds_train.round())
precision_test = precision_score(y_test, y_preds_test.round())
recall_train = recall_score(y_train, y_preds_train.round())
recall_test = recall_score(y_test, y_preds_test.round())
f1_train = f1_score(y_train, y_preds_train.round())
f1_test = f1_score(y_test, y_preds_test.round())
print("Train: Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_train, precision_train, recall_train, f1_train))
print("Test : Accuracy:{0:,.3f}, Precision:{1:,.3f}, Recall:{2:,.3f}, F1:{3:,.3f}".format(accuracy_test, precision_test, recall_test, f1_test))
# print("\nConfusion matrix:")
# from sklearn.metrics import confusion_matrix
# confusion_matrix = confusion_matrix(y_test, y_preds_test)
# print(f"{confusion_matrix}, \n")
# sns.heatmap(confusion_matrix, annot=True, fmt='.0f')

#Visualizing y_pred in the dataset
y_preds_all = model_dl.predict(X_scaled)
churn_ds["Exited_predicted"] = y_preds_all
churn_ds.to_excel("model_dl.xlsx")

# 10. Model Deployment

In [None]:
#Entering Xs

# num_prods = int(input("Enter the client´s number of Products: "))
# is_act = str(input("Is the client an active member (Yes/No)? "))
# if is_act == "No":
#     is_act = 0
# else:
#     is_act = 1
# age = float(input("Enter the client´s age: "))
# female = str(input("Enter the client gender (Male/Female): "))
# if female == "Male":
#     female = 0
# else:
#     female = 1
# balance = float(input("Enter the client´s balance: "))
# has_cr_cd = str(input("Does the client have a credit card (Yes/No)? "))
# if has_cr_cd == "No":
#     has_cr_cd = 0
# else:
#     has_cr_cd = 1
# estim_sal = float(input("Enter the client´s estimated salary: "))
# cred_score = int(input("Enter the client´s credit score: "))
# tenure = int(input("Enter the client´s tenure: "))
# bal_sal_prop= balance/estim_sal

#Defining Xs

# X_mod_dep = pd.DataFrame({"NumOfProducts":[num_prods], "IsActiveMember": [is_act], "Age": [age], 
#                           "Gender_Female": [female], "Balance": [balance], "HasCrCard": [has_cr_cd], 
#                           "BalanceSalaryProportion": [bal_sal_prop], "EstimatedSalary": [estim_sal], 
#                           "CreditScore": [cred_score], "Tenure": [tenure]})
#Choosing an specific client for testing:
X_mod_dep = pd.DataFrame({"NumOfProducts":[4], "IsActiveMember": [0], "Age": [29], 
                          "Gender_Female": [1], "Balance": [115046], "HasCrCard": [1], 
                          "BalanceSalaryProportion": [0.9639], "EstimatedSalary": [119346], 
                          "CreditScore": [376], "Tenure": [4]})

#Appending X_mod_dep to original X dataframe, so we can scale it all together next

X_with_X_mode_dep = X.append(X_mod_dep)
X_with_X_mode_dep.reset_index(drop=True)

#Scaling all features

from sklearn.preprocessing import MinMaxScaler
sc_X = MinMaxScaler()
X_scaled = sc_X.fit_transform(X_with_X_mode_dep)
X_scaled = pd.DataFrame(X_scaled)

#Recovering X_mod_dep row in dataframe after scaling

X_mod_dep = X_scaled.tail(1)

#Predicting results

prediction = model_xgbc.predict(X_mod_dep).round()
if prediction == 0:
    prediction_answer = "No"
else:
    prediction_answer = "Yes"

print("")
print(f"Is this client predicted as in risk of leaving the bank? {prediction_answer}.")

# 11. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

In this project we went through all the process from defining the business objective, collecting data, exploring features and distributions, treating data, understanding correlations, selecting relevant features, data modelling and presenting 7 different algorithms with metrics to select the best to predict the Customer´s risk of leaving our business, what´s crucial for the bank since with it we can start taking measures to avoid it, keeping the client portfolio. The chosen model was XGBoost, with around 83% accuracy.