# Predict Churning Customers
**Kaan Akkartal**


# 1. Library and Data Loading

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams

In [None]:
bank_churners = pd.read_csv("../input/credit-card-customers/BankChurners.csv")

In [None]:
bank_churners.head()

In [None]:
df = bank_churners.copy()
df.head()

In [None]:
df = df.iloc[:,:-2]
df.head()

According to the given information by data owner, last two columns are deleted.

# 2. Descriptive Data Analysis

In [None]:
df.info()

There are both categorical and numerical variables in the dataset. 

In [None]:
df_features = list(df.dtypes.index)
df_features

In [None]:
df.describe()

# 2.1. Missing Value Analysis

There are some observations called "Unknown" in the dataset. They need to be detected and processed for further activities.

In [None]:
k = 0
for i in df_features:
    if df[df_features[k]].dtypes == "object":
        df[df_features[k]].value_counts().plot.barh()
        plt.title(df_features[k])
        plt.show()
    k = k+1

Education Level, Marital Status and Income Category have Unknown values. They will be assigned as missing values.

In [None]:
df.loc[df.Marital_Status == "Unknown"].Marital_Status.index

In [None]:
df.iloc[df.loc[df.Marital_Status == "Unknown"].Marital_Status.index,6] = np.nan

In [None]:
df.iloc[df.loc[df.Education_Level == "Unknown"].Education_Level.index,5] = np.nan

In [None]:
df.iloc[df.loc[df.Income_Category == "Unknown"].Income_Category.index,7] = np.nan

In [None]:
df.isnull().sum()

Unknown observations are assigned as missing values. 

In [None]:
df.Education_Level.isnull().sum()/len(df)

In [None]:
df.Marital_Status.isnull().sum()/len(df)

In [None]:
df.Income_Category.isnull().sum()/len(df)           

Missing values are observed less than 15% of the category that they belong. 

In [None]:
import missingno as msno

In [None]:
msno.matrix(df);

In [None]:
msno.heatmap(df);

There are no correlations between missing values which means they are distributed randomly. Therefore they can be removed or mode values of each features can be assigned to them.

In [None]:
df["Marital_Status"] = df["Marital_Status"].fillna(df.Marital_Status.mode()[0])

In [None]:
df["Education_Level"] = df["Education_Level"].fillna(df.Education_Level.mode()[0])

In [None]:
df["Income_Category"] = df["Income_Category"].fillna(df.Income_Category.mode()[0])

In [None]:
df.isnull().sum()

Mode values are assigned, now there are not any missing values in dataset.

In [None]:
df.head()

# 2.2. Data Visualization

In [None]:
df.info()

Variables are visualized with detail of the target variable "Attrition Flag". 
Numerical and categorical features are listed seperately for better investigation.
Numerical variables are divided into two groups in order to make some clear boxplots.

In [None]:
df_features_numerics = []
k = 0
for i in df_features:
    if df[df_features[k]].dtypes != "object": 
        df_features_numerics.append(df_features[k])
    k = k+1
print(df_features_numerics)

In [None]:
df_features_numerics_1 = df_features_numerics[1:8]
df_features_numerics_2 = df_features_numerics[8:]
print(df_features_numerics_1)
print(df_features_numerics_2)

In [None]:
k = 0
fig,ax = plt.subplots(ncols = len(df_features_numerics_1), figsize = (25,8))
for i in df_features_numerics_1:
    sns.boxplot(x = "Attrition_Flag", y = df_features_numerics_1[k], data = df,  ax = ax[k])
    k = k+1

In [None]:
k = 0
fig,ax = plt.subplots(ncols = len(df_features_numerics_2), figsize = (30,10))
for i in df_features_numerics_2:
    sns.boxplot(x = "Attrition_Flag", y = df_features_numerics_2[k], data = df,  ax = ax[k])
    k = k+1

The effects of "Age", "Months_on_book" and "Avg_Open_To_Buy" variables on "Attrition_Flag" seem not so strong. 

In [None]:
df_features_objects = []
k = 0
for i in df_features:
    if df[df_features[k]].dtypes == "object":
        df_features_objects.append(df_features[k])
    k = k+1
print(df_features_objects)

In [None]:
k = 0
fig,ax=plt.subplots(ncols=6,figsize=(20,5))
for i in df_features_objects:
    sns.countplot(x="Attrition_Flag", hue = df_features_objects[k], data=df,  ax = ax[k])
    k = k+1

The amount of existing customers are approximately eight times higher than attrited ones.
The distribution of categorical variables look similar in existing and attrited customers.
Graduates, married customers, customers with less than $40K and blue card categories dominate the dataset.

# 2.3. Correlation

In [None]:
rcParams['figure.figsize'] = 16,6.0
sns.heatmap(df.corr(), vmin = -1, vmax = 1, annot = True);

**The variables below are highly correlated to each other:**

Total_Revolving_Bal and Avg_Utilization_Ratio

Months_on_book and Customer_Age

Credit_Limit and Avg_Utilization_Ratio

Avg_Open_To_Buy and Avg_Utilization_Ratio 

Months_on_book and Avg_Utilization_Ratio will not be taken into account for modelling 

# 2.4. Data Preparation

In [None]:
df.info()

In [None]:
df_features_objects

In [None]:
df.Attrition_Flag = pd.Categorical(df.Attrition_Flag)  
df.Gender = pd.Categorical(df.Gender)  
df.Education_Level = pd.Categorical(df.Education_Level)  
df.Marital_Status = pd.Categorical(df.Marital_Status)  
df.Income_Category = pd.Categorical(df.Income_Category)  
df.Card_Category = pd.Categorical(df.Card_Category)  

In [None]:
df.info()

The type of objects are transformed to category for modelling.
Features and labels are seperated as df_X and df_Y. 
Clientnum has no effect and Avg_Utilization_Ratio, Months_on_book features are correlated with others. So they are out of the modelling. Attrition Flag is the target variable. 

In [None]:
df_X = df.drop(columns = ["Attrition_Flag", "CLIENTNUM", "Avg_Utilization_Ratio", "Months_on_book"], axis = 1)
df_X

In [None]:
df_Y = df.Attrition_Flag
df_Y

In [None]:
df_X = pd.get_dummies(df_X, drop_first = True)
df_X

In [None]:
df_X.info()

In [None]:
df_Y = pd.get_dummies(df_Y, drop_first = True)
df_Y

Existing Customer --> 1

Attrited Customer --> 0 

One hot encoding is applied to the features. We will try to predict the Attrited Customers which is "0" in target variable. 

In [None]:
from sklearn.model_selection import train_test_split
df_X_train, df_X_test, df_Y_train, df_Y_test = train_test_split(df_X , df_Y, test_size = 0.40)

# 3. Machine Learning

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

log_model = LogisticRegression().fit(df_X_train, df_Y_train)
knn_model = KNeighborsClassifier().fit(df_X_train, df_Y_train)
rf_model = RandomForestClassifier().fit(df_X_train, df_Y_train)
lgbm_model = LGBMClassifier().fit(df_X_train, df_Y_train)


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
log_model_score = accuracy_score(df_Y_test, log_model.predict(df_X_test)) 
knn_model_score = accuracy_score(df_Y_test, knn_model.predict(df_X_test)) 
rf_model_score = accuracy_score(df_Y_test, rf_model.predict(df_X_test)) 
lgbm_model_score = accuracy_score(df_Y_test, lgbm_model.predict(df_X_test)) 

print("Logistic regression accuracy score is: " , log_model_score)
print("KNN accuracy score is: " , knn_model_score)
print("Random forest accuracy score is: " , rf_model_score)
print("LGBM accuracy score is: " , lgbm_model_score)

Logistic Regression, KNN, Random Forest and Light GBM are applied. Tree based models have better accuracy score. LGBM is the best.

In [None]:
print(classification_report(df_Y_test, log_model.predict(df_X_test)))
print(classification_report(df_Y_test, knn_model.predict(df_X_test)))
print(classification_report(df_Y_test, rf_model.predict(df_X_test)))
print(classification_report(df_Y_test, lgbm_model.predict(df_X_test)))

LGBM showed the best performance in terms of accuracy and recall score. So it will be tuned to make this result better. 

**Model Tuning**

In [None]:
lgbm_grid = {"n_estimators": [50, 100, 500, 1000, 2000],
        'learning_rate': [0.5,0.1,0.01,0.02,0.05]}

In [None]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV
lgbm = LGBMRegressor()
lgbm_cv_model = GridSearchCV(lgbm, lgbm_grid, cv=10, n_jobs = -1, verbose = 2)
lgbm_cv_model.fit(df_X_train, df_Y_train)

In [None]:
lgbm_cv_model.best_params_

In [None]:
lgbm_tuned = LGBMRegressor(learning_rate = 0.02, n_estimators = 1000).fit(df_X_train, df_Y_train)

In [None]:
y_pred = lgbm_tuned.predict(df_X_test)

In [None]:
accuracy_score(df_Y_test, y_pred.round())

In [None]:
print(classification_report(df_Y_test, y_pred.round()))

Attrited Customer is represented by "0".
There is no significant change in Recall of "0" after hyperparameter tuning. 

**Feature Importance** 

In [None]:
Importance = pd.DataFrame({"Importance": lgbm_tuned.feature_importances_*100},
                         index = df_X_train.columns)

In [None]:
Importance.sort_values(by = "Importance", 
                       axis = 0, 
                       ascending = True).plot(kind ="barh", color = "r")

plt.xlabel("Feature Importance Levels")

The most important features are Total_Trans_Amt and Total_Trans_Ct. It seems numerical variables affect customer churns more than categorical ones. 