# Predict Churing Customers

A bank is having a problem of having a high number of customers leaving their credit card services. The bank would like to identify these customers early in order to maliciously trap them in a usurous cycle. The goal of this brief investagation will be to try to develope a model which identifes customers who are about to leave their credit card services. While this data can be used by banks in order to retain customers. This data can be utilized by a regulating body such as the SEC in order to integrate into a larger model in order to identify preditory lending techniques and identify at risk consumers. 

This investagation will explore serveral different models and compare the resultant accuracy of these model via a cross validation metric along with a brief discussion. 

# Preprocessing

The data is largely clean and complete and as such not much preprocessing occurs. There are some categorical columns which have "Unknown" values. These values could possibly be filled/dropped however since they don't significantly raise the cardinality of the data as well as the primary goal of this investagation is to try several models, they will be treated as their own outright feature. The data will not be split into a test and training set since we our evaluating our model using cross validation. The only significant prepocessing is to one hot encode all the categorical columns.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("../input/credit-card-customers/BankChurners.csv")

In [None]:
data = data.drop('Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1', axis=1)
data = data.drop('Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2', axis=1)
data = data.drop('CLIENTNUM', axis = 1)
data.head()

In [None]:
# Attrition_Flag is going to be our target
data['Attrition_Flag'].unique()


# Remove rows with missing target, separate target from predictors
data.dropna(axis=0, subset=['Attrition_Flag'], inplace=True)
y = data['Attrition_Flag']
data.drop(['Attrition_Flag'], axis=1, inplace=True)

y.head()

In [None]:
# select low cardinality categorical columns with low cardinality
categorical_cols = [cname for cname in data.columns if 
                   data[cname].nunique() < 10 and
                   data[cname].dtype == 'object']
# probably redundant with this dataset since its pretty clean
  # but double checking is good practice

In [None]:
# select numerical columns
num_cols = [cname for cname in data.columns if 
           data[cname].dtype in ['int64', 'float64']]

In [None]:
# Only keep these specific colums
my_cols = categorical_cols + num_cols
X = data[my_cols].copy()

# One hot encode via pandas for speed
X= pd.get_dummies(X)
X.head()

# Build Models - Try a bunch out

The goal of this script is to spray the problem with a bunch of different algoritms to evaluate each algorithm's performance. This is going to take a while to run and is probably not the most efficent method of modelling, however this is primarially an educational experience for myself. Additionally, some of the algorithms such as the k-NN and SVMs are supposedly sensitive to feature transformations. Features should be scaled when using these algoritms and thus I hypothesize before running them that they will perform poorly. In fact, I would honestly be surprised if they converge to a global minimum. I am going to be lazy and use the vanilla arguements, however, as a learning exercise I think it would be interesting to see if this is actually reflected in the results.

Since only 16% of customers have churned, I am implementing cross validation in order to score these models. 

_*this will probably have to run overnight_

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.svm import NuSVC
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier


from sklearn.model_selection import cross_val_score


def evaluate_model(model):
    my_model = model()
    scores = cross_val_score(my_model, X, y,
                              cv=5,
                              scoring='accuracy')
    return scores.mean()

algoritms = [GaussianNB,
            DecisionTreeClassifier,
            RandomForestClassifier,
            KNeighborsClassifier,
            SVC,
            LinearSVC,
            XGBClassifier]

str_algoritms = ['GaussianNB',
                 'DecisionTreeClassifier',
                 'RandomForestClassifier',
                 'KNeighborsClassifier',
                 'SVC',
                 'LinearSVC',
                 'XGBClassifier']

results = {}
for i in range(0, len(algoritms)):
    results[str_algoritms[i]] = evaluate_model(algoritms[i])

results

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))


sns.barplot(x=list(results.keys()), y=list(results.values()))

plt.ylabel('Accuracy')
plt.title('Algorithm Performance')


# Conclusion
As was predicted the worst performing algoritms were the ones which are sensitive to feature transformation (k-NN and SVCs). Additionally, due to some of the warnings yielded, it seems the support vector machines may not have even converged on every itteration. This validated my initial hypothesis as I didn't scale/normalize any of my features. A nonconvergance is an additional demonstration for why you want to normalize utilizing these algorithms. Additionally, the vanilla arguements (learning parameters, ext ...) should be adjusted for less marginal results 


The next worst performing algoithm was the GaussianNB which is a Naive Bayes variant, which is not a huge surprise. Naive Bayes relies on an assumption that features are independant of one another, which may not be the case with our dataset. For example there is a possible correlation between Education_Level and Income_Category. The next best performing algorithm is the decission tree classifier. Descission trees benefit from the fact that they are relatively simple models and require relatively little data preparation. Finally, our top perfoming algorithms are the two ensemble algorithms - the random forest classifier and the XGBClassifier. The XGBClassifier was able to predict whether or not a customer was going to churn with an accuracy of 93%. 