# Introduction 

This notebook will not have too many markdowns explaining whats going on. I will get straight to the point and if certain things you need more information/explanation about, then Google or YouTube it :) 

Here we are working with a dataset of around 10,000 credit card customers. Our goal is to predict which customer is going to get churned. 

This is a classification problem. We are also working with labeled data so it will be a supervised learning. Thus, we are going to work with Random Forest Classifier. 

NOTE: The code for using logistic regression and support vector machine is included. it is just commented out because my computer cant fit/train the models in a reasonable amount of time. 

We are going to pre-process the data and split it into a train set, and a test set using make_column_transformation and StratifiedKFold. Then we will use GridSearchCV to tune the hyper parameters of a Random Forest Classifier. We will fit the model using our training sets. Finally, use our test set to put our model to worl. We will be using accuracy_score and a confusion matrix to evaluate our model at the end. 

Enjoy :) 

In [None]:
import pandas as pd
import os
import numpy as np
np.set_printoptions(threshold=np.inf)
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import normalize

In [None]:
def load_dataset():
    csv_path = os.path.join("../input/credit-card-customers/BankChurners.csv")
    return pd.read_csv(csv_path)

bank_data = load_dataset()

In [None]:
bank_data = bank_data.drop(columns=[
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"
])

The Attrition_Flag attribute will be the labels for the data. 

We will use OneHotEncoder() to change the categorical attributes to numarical types. 

In [None]:
X = bank_data.drop("Attrition_Flag", axis=1)
Y = bank_data["Attrition_Flag"]

In [None]:
folds = StratifiedKFold(n_splits=3)

for train_index, test_index in folds.split(X,Y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]

In [None]:
cat_att = [
    "Gender",
    "Education_Level",
    "Marital_Status",
    "Income_Category",
    "Card_Category"
]

column_trans = make_column_transformer(
    (
        OneHotEncoder(), cat_att
    ),
    remainder="passthrough"
)

In [None]:
final_X_train = column_trans.fit_transform(X_train)
final_X_test = column_trans.fit_transform(X_test)

In [None]:
cat_to_num = {
    "Attrition_Flag": {"Existing Customer": 0, "Attrited Customer": 1}
}

Y_train, Y_test = Y_train.to_frame(), Y_test.to_frame()

final_Y_train = Y_train.replace(cat_to_num)
final_Y_test = Y_test.replace(cat_to_num)

final_Y_train = final_Y_train.values
final_Y_test = final_Y_test.values

We will be tuning the n_estomators, class_weight, max_features, max_depth, and min_sample_split parameters of Random Forest Classifier. 

Then use GridSearchCV to, put simply, find the best combination of the those parameters that will give us the best MEAN score using the train set. 

## Random Forest Classifier 

We will first define our hyper parameters that we want to tune. Some of the parameters are commented out. That is just because it takes FOREVER to train the model because the way GridSearchCV works. Basically GridSearchCV is an exhaustive search technique where it tries every combination of the different parameters. So as you can tell, the number of combinations can be super high and i am not planning on putting my laptop through that lol. 

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# rfc_max_features = [
#     "auto", 
#     "sqrt", 
#     "log2"
# ]

# rfc_class_weight = [
#     "balanced",
#     "balanced_subsample"
# ]

rfc_max_depth = [10, 20]

rfc_min_sample_split = [2, 5]


rfc_params = {
    'n_estimators': n_estimators,
    # "class_weight": rfc_class_weight,
    # "max_features": rfc_max_features,
    "max_depth": rfc_max_depth,
    "min_samples_split": rfc_min_sample_split
    }

In [None]:
rfc_gs = GridSearchCV(
    estimator= RandomForestClassifier(),
    param_grid= rfc_params,
    cv= 5   
)

The next four (4) cells is similar code to prepare the hyper parameters for tuning. The first two (2) cells are for support vector machine model (SVM) and the next two (2) after that is for a logestic regression model. I commented them out becasue, again, computation time reasons. But, just simply uncomment them and use it just how i used the Random Forest Classifier. :)

In [None]:
# svm_C = [1, 10, 20]
# kernel = ["rbf", "linear"]

# svm_param = {
#     "C": svm_C,
#     "kernel": kernel
# }

In [None]:
# svm_gs = GridSearchCV(
#     estimator= svm.SVC(),
#     param_grid= svm_param,
#     cv= 5
# )

In [None]:
# lr_C = [1, 10, 20]
# solver = ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]

# lr_param = {
#     "C": lr_C,
#     "solver": solver
# }

In [None]:
# lr_gs = GridSearchCV(
#     estimator= LogisticRegression(),
#     param_grid= lr_param,
#     cv= 5
# )

## fitting the data 

In [None]:
rfc_gs.fit(final_X_train, final_Y_train)
# svm_gs.fit(final_X_train, final_Y_train)
# lr_gs.fit(final_X_train, final_Y_train)

After fitting the train data, we can see the results of using the different combinations of parameters. 

In [None]:
rfc_gs_results = pd.DataFrame(rfc_gs.cv_results_)
# svm_gs_results = pd.DataFrame(svm_rs.cv_results_)
# lr_gs_results = pd.DataFrame(lr_rs.cv_results_)

In [None]:
rfc_gs_results = rfc_gs_results[[
    "param_n_estimators",
    # "param_class_weight",
    # "param_max_features",
    "param_max_depth",
    "param_min_samples_split",
    "mean_test_score"
]]

In [None]:
# svm_gs_results = svm_gs_results[[
#     "param_kernel",
#     "param_C",
#     "mean_test_score"
# ]]

In [None]:
# lr_gs_results = lr_gs_results[[
#     "param_solver",
#     "param_C",
#     "mean_test_score"
# ]]

## predict the values 

Here we are going to predict the labels using the test set.

Then using accuracy_score and the confusion matrix to evaluate the models performance. 

In [None]:
rfc_Y_pred = rfc_gs.predict(final_X_test)
# svm_Y_pred = svm_gs.predict(final_X_test)
# lr_Y_pred = lr_gs.predict(final_X_test)

In [None]:
rfc_score = accuracy_score(final_Y_test, rfc_Y_pred)
# svm_score = accuracy_score(final_Y_test, svm_Y_pred)
# lr_score = accuracy_score(final_Y_test, lr_Y_pred)

In [None]:
# (tn, fp, fn, tp)
rfc_conf_matrix = confusion_matrix(final_Y_test, rfc_Y_pred)
# svm_conf_matrix = confusion_matrix(final_Y_test, svm_Y_pred).ravel()
# lr_conf_matrix = confusion_matrix(final_Y_test, lr_Y_pred).ravel()

Here are the results of our model. 

We are printing the model name. The models best parameters, since GridSearchCV has an attribute to display the best params. The models accuracy score. The models confusion matrix. 

The confusion matrix is in the form of ==> (true negative, false positive, false negative, true positive) 

In [None]:
models = [
    "Random Forest Classifier",
    # "Support Vector Machines",
    # "Logestic Regression"
]

best_params = [
    rfc_gs.best_params_,
    # svm_gs.best_params_,
    # lr_gs.best_params_
]

acc_scores = [
    rfc_score,
    # svm_score,
    # lr_score
]

conf_matrices = [
    rfc_conf_matrix,
    # svm_conf_matrix,
    # lr_conf_matrix
]

# for i in range(0, 3):
print("Model: {}".format(models[0]))
print("Best Params: {}".format(best_params[0]))
print("Accuracy Score: {}".format(acc_scores[0]))
tn, fp, fn, tp = conf_matrices[0].ravel()
print("Confusion Matrix: {}".format((tn, fp, fn, tp)))
    # print("\n\n")

# Conlusion


At the beginning these were my results: 

Model: Random Forest Classifier
Best Params: {'min_samples_leaf': 1, 'n_estimators': 1200}
Accuracy Score: 0.9013333333333333
Confusion Matrix: (2823   10  323  219)


I think we have slightly improved. We now have:

Model: Random Forest Classifier
Best Params: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 1200}
Accuracy Score: 0.9016296296296297
Confusion Matrix: (2825, 8, 324, 218)



Thats all, hopefully someone found this useful lol
 


Thank you :) 