- Customer churn prediction is a binary classification problem to be solved by supervised learning.
- Let's use major supervised learning algorithm and compare the results.
   - Logistic Regression
   - KNN
   - SVM
   - Decision Tree
   - Random Forest
   - AdaBoost
   - XGBoost
   - LightGBM
   - CatBoost
   - Neural Network
- The data is imbalanced, so apply rebalancing methods and compare their results with those of baseline models.
   - Baseline models: Imbalanced data
   - Random Oversampling
   - SMOTE
   - Borderline-SMOTE
   - Borderline-SMOTE SVM
   - ADASYN
   - SMOTE-TomekLinks
   - SMOTE-ENN
- Performance measure
   - AUC score

## Overview of the Resampling Methods

### Oversampling Methods

#### Random Oversampling
- It duplicates the minority class examples randomly, and added them to the training set.
- Since it is based on simple duplications, it does not provide any additional information to the model.
- Duplication is implemented with replacement, so it's likely to result in overfitting.

#### SMOTE (Synthetic Minority Oversampling Technique)
- Instead of simply duplicating existing minority class, SMOTE oversample the minority class by generating synthetic data.
- Synthesizing new data is based on feature space similarity between exisitng minority class examples.
- It randomly selects minority class cases, and generate new minority class cases by interpolations based on KNN.

#### Borderline-SMOTE
- An extension of SMOTE
- While SMOTE randomly selects minority datapoints for synthesizing, Borderline-SMOTE selects them along the decion boundary between the classes.
- It deals with the datapoints that are likely to be misclassified.

#### Borderline-SMOTE SVM
- An variation of Borderline-SMOTE
- It uses SVM to approximately identify the borderline.
- Then it randomly creates synthetic data along the borderline.
- Datapoints far from the borderline are synthesized preferentially.

#### ADASYN (Adaptive Synthetic sampling )
- An variation of SMOTE
- It oversamples the minority class based on the data density distributions.
- It generates synthetic data more in areas where minority class is less dense. 

### Hybrid Methods

These methods combines undersampling and oversampling methods.

#### SMOTE-TomekLinks
- Tomek Links is a method for identifying pairs of nearest neighbors each of them belong to different classes. 
- By removing one or both of these pairs, we can make the decision boundary clearer and less noisy.
- SMOTE-TomekLink oversamples the minority class by SMOTE, and then, remove the majority class cases in Tomek Links.

#### SMOTE-ENN (SMOTE Edited Nearest Neighbors)
- ENN is an undersampling method that identify and remove any misclassifed examples based on KNN (k=3).
- ENN can be applied to all classes or just examples in the majority class.
- SMOTE-ENN oversamples the minority class by SMOTE, and then, remove the cases by ENN.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Import Libraries

In [None]:
# Import libraries for data manipulation and preprocessing
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Import libraries for resampling
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import ADASYN
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTEENN

In [None]:
# Import libraries for classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

# Import libraries for visualization and set the display
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
sns.set(color_codes=True)
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

# 2. Load and Explore Data

In [None]:
# Load the dataset
df=pd.read_csv('../input/credit-card-customers/BankChurners.csv')

# Show the shape and the first 5 rows
print('Shape of the data', df.shape)
df.head()

In [None]:
# Delete the last two columns
# "CLIENTNUM" is not needed for prediction. So, let's delete it.
df=df.drop(["CLIENTNUM",
            "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
            "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"],
           axis = 1)

In [None]:
# Display descriptive statistics
df.describe()

In [None]:
# Let's see the type of each column
df.info()

- There are no missing values in this dataset.
- Some are categorical variables. So we need to encode them.
  - They are "Attrition_Flag","Gender","Education_Level","Marital_Status","Income_Category","Card_Category"

In [None]:
categories = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']

for cat in categories:
    cross_tab = pd.crosstab(df[cat], df['Attrition_Flag'], normalize='index')
    cross_tab.plot.bar(stacked=True)
    plt.show()

In [None]:
cols = ["Customer_Age", "Dependent_count", "Months_on_book", "Total_Relationship_Count","Months_Inactive_12_mon",
        "Contacts_Count_12_mon", "Credit_Limit","Total_Revolving_Bal","Avg_Open_To_Buy","Total_Amt_Chng_Q4_Q1",
        "Total_Trans_Amt", "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1", "Avg_Utilization_Ratio"
        ]

In [None]:
for col in cols:
    sns.violinplot(data=df, x='Attrition_Flag', y=col)
    #set_title(col)
    #set_ylabel('')
    plt.show()

In [None]:
sns.pairplot(df, hue='Attrition_Flag')

# 3. Data Preprocessing

## 3.1. Convert categorical variables to numerical

In [None]:
# Show the unique values of categorical variables
print("Attrition_Flag :",df["Attrition_Flag"].unique())
print("Gender         :",df["Gender"].unique())
print("Education_Level:",df["Education_Level"].unique())
print("Marital_Status :",df["Marital_Status"].unique())
print("Income_Category:",df["Income_Category"].unique())
print("Card_Category  :",df["Card_Category"].unique())

In [None]:
# Convert variables with two categories into binary variables
df.loc[df["Attrition_Flag"] == "Existing Customer", "Attrition_Flag"] = 0
df.loc[df["Attrition_Flag"] == "Attrited Customer", "Attrition_Flag"] = 1
df["Attrition_Flag"] = df["Attrition_Flag"].astype(int)

df.loc[df["Gender"] == "F", "Gender"] = 0
df.loc[df["Gender"] == "M", "Gender"] = 1
df["Gender"] = df["Gender"].astype(int)

In [None]:
df.head()

In [None]:
#One hot encoding for Categorical variables
df = pd.get_dummies(df)
df.head()

In [None]:
# Split data into train and test Datasets

# Separate the dataset into features and target
X = df.drop(["Attrition_Flag"],axis=1)
y = df["Attrition_Flag"]

y.value_counts()

The data is imbalanced.

In [None]:
# Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Rebalancing Samples

In [None]:
# Apply rebalancing

# Random Oversampling
over_X_train, over_y_train = RandomOverSampler(sampling_strategy='minority').fit_resample(X_train, y_train)
# SMOTE
smote_X_train, smote_y_train = SMOTE().fit_resample(X_train,y_train)
# Boderline-SMOTE
bdlsmote_X_train, bdlsmote_y_train = BorderlineSMOTE().fit_resample(X_train, y_train)
# Boderline-SMOTE SVM
bdlSVMsmote_X_train, bdlSVMsmote_y_train = SVMSMOTE().fit_resample(X_train, y_train)
# ADASYN
adasyn_X_train, adasyn_y_train = ADASYN().fit_resample(X_train, y_train)
# SMOTE-TomekLinks
smotetomek_X_train, smotetomek_y_train = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority')).fit_resample(X_train, y_train)
# SMOTE-ENN
smoteenn_X_train, smoteenn_y_train = SMOTEENN().fit_resample(X_train, y_train)

In [None]:
# Check the results of rebalancing

# Random Oversampling
print("Random Oversampling\n", over_y_train.value_counts())
# SMOTE
print("SMOTE\n", smote_y_train.value_counts())
# Boderline-SMOTE
print("Borderline-SMOTE\n", bdlsmote_y_train.value_counts())
# Boderline-SMOTE SVM
print("Borderline-SMOTE SVM\n", bdlSVMsmote_y_train.value_counts())
# ADASYN
print("ADASYN\n", adasyn_y_train.value_counts())
# SMOTE-TomekLinks
print("SMOTE-TomekLinks\n", smotetomek_y_train.value_counts())
# SMOTE-ENN
print("SMOTE-ENN\n", smoteenn_y_train.value_counts())

In [None]:
datasets = [X_train, y_train, over_X_train, over_y_train, smote_X_train, smote_y_train,
            bdlsmote_X_train, bdlsmote_y_train, bdlSVMsmote_X_train, bdlSVMsmote_y_train, 
            adasyn_X_train, adasyn_y_train, smotetomek_X_train, smotetomek_y_train, 
            smoteenn_X_train, smoteenn_y_train]

for dataset in datasets:
    pd.DataFrame(dataset)

In [None]:
# Concatenate training and test sets for each resampled datasets
train_concat = pd.concat([X_train, y_train], axis=1)
over_train_concat = pd.concat([over_X_train, over_y_train], axis=1)
smote_train_concat = pd.concat([smote_X_train, smote_y_train], axis=1)
bdlsmote_train_concat = pd.concat([bdlsmote_X_train, bdlsmote_y_train], axis=1)
bdlSVMsmote_train_concat = pd.concat([bdlSVMsmote_X_train, bdlSVMsmote_y_train], axis=1)
adasyn_train_concat = pd.concat([adasyn_X_train, adasyn_y_train], axis=1)
smotetomek_train_concat = pd.concat([smotetomek_X_train, smotetomek_y_train], axis=1)
smoteenn_train_concat = pd.concat([smoteenn_X_train, smoteenn_y_train], axis=1)

In [None]:
# Visualize resampling results
fig, axes = plt.subplots(4, 2, figsize=(20, 20),squeeze=True)
plt.subplots_adjust(wspace=0.2, hspace=0.4)
fig.suptitle('Resampling Result')

sns.scatterplot(ax=axes[0, 0], data=train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag')
sns.scatterplot(ax=axes[0, 1], data=over_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag', legend=False)
sns.scatterplot(ax=axes[1, 0], data=smote_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)
sns.scatterplot(ax=axes[1, 1], data=bdlsmote_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)
sns.scatterplot(ax=axes[2, 0], data=bdlSVMsmote_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)
sns.scatterplot(ax=axes[2, 1], data=adasyn_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)
sns.scatterplot(ax=axes[3, 0], data=smotetomek_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)
sns.scatterplot(ax=axes[3, 1], data=smoteenn_train_concat, x='Total_Amt_Chng_Q4_Q1', y='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag',legend=False)

axes[0, 0].set_title("Imbalanced Data")
axes[0, 1].set_title("Random Oversampling")
axes[1, 0].set_title("SMOTE")
axes[1, 1].set_title("Borderline-SMOTE")
axes[2, 0].set_title("Borderline-SMOTE SVM")
axes[2, 1].set_title("ADASYN")
axes[3, 0].set_title("SMOTE-TomekLinks")
axes[3, 1].set_title("SMOTE-ENN")

In [None]:
# Standardization
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Apply rebalancing to standardized data

# Random Oversampling
over_X_train, over_y_train = RandomOverSampler(sampling_strategy='minority').fit_resample(X_train, y_train)
# SMOTE
smote_X_train, smote_y_train = SMOTE().fit_resample(X_train,y_train)
# Boderline-SMOTE
bdlsmote_X_train, bdlsmote_y_train = BorderlineSMOTE().fit_resample(X_train, y_train)
# Boderline-SMOTE SVM
bdlSVMsmote_X_train, bdlSVMsmote_y_train = SVMSMOTE().fit_resample(X_train, y_train)
# ADASYN
adasyn_X_train, adasyn_y_train = ADASYN().fit_resample(X_train, y_train)
# SMOTE-TomekLinks
smotetomek_X_train, smotetomek_y_train = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority')).fit_resample(X_train, y_train)
# SMOTE-ENN
smoteenn_X_train, smoteenn_y_train = SMOTEENN().fit_resample(X_train, y_train)

In [None]:
# Check the results of rebalancing

# Random Oversampling
print("Random Oversampling\n", over_y_train.value_counts())
# SMOTE
print("SMOTE\n", smote_y_train.value_counts())
# Boderline-SMOTE
print("Borderline-SMOTE\n", bdlsmote_y_train.value_counts())
# Boderline-SMOTE SVM
print("Borderline-SMOTE SVM\n", bdlSVMsmote_y_train.value_counts())
# ADASYN
print("ADASYN\n", adasyn_y_train.value_counts())
# SMOTE-TomekLinks
print("SMOTE-TomekLinks\n", smotetomek_y_train.value_counts())
# SMOTE-ENN
print("SMOTE-ENN\n", smoteenn_y_train.value_counts())

# Classification

In [None]:
# Create a model dictionary
models = {"Logistic Regression   ": LogisticRegression(),
          "K-Nearest Neighbors   ": KNeighborsClassifier(),
          "Support Vector Machine": SVC(probability=True),
          "Decision Tree         ": DecisionTreeClassifier(),
          "Random Forest         ": RandomForestClassifier(),
          "Ada Boost             ": AdaBoostClassifier(),
          "XGBoost               ": XGBClassifier(),
          "LightGBM              ": LGBMClassifier(),
          "CatBoost              ": CatBoostClassifier(verbose=0),
          "Neural Network        ": MLPClassifier()
         }

In [None]:
# Fit the models on imbalanced data
for name, model in models.items():
    model.fit(X_train, y_train)

# Print AUC score
print("Imbalanced Data: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: Random Oversampling
for name, model in models.items():
    model.fit(over_X_train, over_y_train)

# Print AUC score
print("Random Oversampling: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: SMOTE
for name, model in models.items():
    model.fit(smote_X_train, smote_y_train)

# Print AUC score
print("SMOTE: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: Borderline-SMOTE
for name, model in models.items():
    model.fit(bdlsmote_X_train, bdlsmote_y_train)

# Print AUC score
print("Borderline-SMOTE: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: Borderline-SMOTE SVM
for name, model in models.items():
    model.fit(bdlSVMsmote_X_train, bdlSVMsmote_y_train)

# Print AUC score
print("Borderlin-SMOTE SVM: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: ADASYN
for name, model in models.items():
    model.fit(adasyn_X_train, adasyn_y_train)

# Print AUC score
print("ADASYN: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: SMOTE-TomekLinks
for name, model in models.items():
    model.fit(smotetomek_X_train, smotetomek_y_train)

# Print AUC score
print("SMOTE-TomekLinks: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))

In [None]:
# Fit the models: SMOTE-ENN
for name, model in models.items():
    model.fit(smoteenn_X_train, smoteenn_y_train)

# Print AUC
print("SMOTE-ENN: AUC score")
for name, model in models.items():
    print(name + ": {:.3f}".format(roc_auc_score(y_test, model.predict(X_test))))