# Introduction

In this notebook...
* creating some models to analyze customer attrition
* looking at feature importances to see if some intuative relationships between features can be gleaned. (not really)
* I guess that useful relationships can't be visualized in two dimensions, maybe?
* I can't work on this project more today.

# Importing Libraries and Loading Data

All of my library imports should be in the next cell

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import metrics


In [None]:
data = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv', index_col = 'CLIENTNUM')
drop_columns = ['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2']
data.drop(columns = drop_columns, inplace = True)

# Data Analysis

It looks like this dataset is not missing any values, so no cleaning is necessary.<br><br>
First, I created num_columns and cat_columns, lists of the names of the feature names, the categorical and numeric. There's probably a more elegant way to do this, but I like for loops, ok?<br><br>
I then created plots to help visualize each feature in isolation, first the numeric and then the categorical. These plots are histograms and the number of non-attrited customers is far greater than that of the attrited customers. I'll have better plots for comparing attrited and existing customers later in the notebook. This was just to get me an idea of their distributions.


In [None]:
data.info()

In [None]:
num_columns = list(data.columns)
cat_columns = [ 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Gender']
for col in cat_columns:
    num_columns.remove(col)

num_columns.remove('Attrition_Flag')

In [None]:
figsize = (15, 12)
rows = 4
columns = 4
fig, axes = plt.subplots(rows, columns, figsize = figsize)

for i, col in enumerate(num_columns):
    axrow, axcol = int(np.floor(i/rows)), i%columns
    ax_active = axes[axrow, axcol]
    sns.distplot(data[data['Attrition_Flag'] == 'Attrited Customer'][col], ax = ax_active, kde= False)
    sns.distplot(data[data['Attrition_Flag'] == 'Existing Customer'][col], ax = ax_active, kde= False)
    ax_active.set_title(col)
    ax_active.set_xlabel(None)
    ax_active.legend(['Attrited', 'Existing'])

plt.subplots_adjust(hspace = .3)

In [None]:
figsize = (15, 12)
rows = 3
columns = 3
fig, axes = plt.subplots(rows, columns, figsize = figsize)

for i, col in enumerate(cat_columns):
    axrow, axcol = int(np.floor(i/rows)), i%columns
#     print(axrow, axcol)
    ax_active = axes[axrow, axcol]
    sns.countplot(data = data, x = col, hue = 'Attrition_Flag', ax = ax_active)
    ax_active.set_title(col)
    ax_active.set_xlabel(None)
    xlabels = ax_active.get_xticklabels()
    ax_active.set_xticklabels(xlabels, Rotation = 20)
    ax_active.legend(['Attrited', 'Existing'])

plt.subplots_adjust(hspace = .3)

# Encoding Features

There is nothing terribly interesting or clever done here. Just one-hot encoding of the features in my list of categorical features and label encoding of the target.

I just used one-hot encoding for all of them because, based on the plots above, none of the features has a terribly high cardinality. In the end, I have fewer than 50 features total, which should fit fairly quickly for most models.

I'm also going to include the cell in which I did the test train split in this section. Before the split, I applied a minmax scaler so that linear models would be able to work with the data as well.

In [None]:
dummies = pd.get_dummies(data[cat_columns])
data_basic = pd.merge(data.copy(), dummies, on = 'CLIENTNUM')
data_basic.drop(columns = cat_columns, inplace = True)

In [None]:
attrition_dict = {'Existing Customer':0, 'Attrited Customer':1}
data_basic.loc[:, 'Attrition_Flag'] = data_basic.loc[:, 'Attrition_Flag'].map(attrition_dict)
data_basic.head()


In [None]:
X = data_basic.drop(columns = ['Attrition_Flag']).copy()
y = data_basic['Attrition_Flag']
minmax = preprocessing.MinMaxScaler()

X = minmax.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, stratify = y, test_size = .25)
minmax.scale_

# Classifiers
I created some classifiers using easy classifiers from sklearn, including logrithmic, Random Forest, and Gradient Boosting. Random Forest and Gradient Boosting performed about the same, which isn't too surprising since they're related. 
<br>
* Logrithmic did not do very well, with about .53 precision and .80 recall
* Random Forest and Gradient Boost had about .78 precision and .97 recall. <br><br><br><br>

Either the GB or RF classifiers are fairly useful in and of themselves. They promise to identify all but 3% of the customers about to churn, with 22% of the identifications being false positives, so for every 8 (actually closer to 7) customers these classifiers say need attention, there are 2 they make recommendations for that probably will not churn.

What I'm most interested in is the feature importances attribute of the rf classifier, but more on that later in the notebook.

In [None]:
LogClassifier = LogisticRegression(random_state = 3157, max_iter = 300)
LogClassifier.fit(X_train, y_train)
y_pred_log = LogClassifier.predict(X_test)
target_names = ['Existing Customer', 'Attrited Customer']

print(metrics.classification_report(y_pred_log, y_test))

In [None]:
Classifier_rf = RandomForestClassifier()
Classifier_rf.fit(X_train, y_train)
y_pred_rf = Classifier_rf.predict(X_test)

print(metrics.classification_report(y_pred_rf, y_test))

In [None]:
Classifier_xgb = GradientBoostingClassifier(random_state = 2290)
Classifier_xgb.fit(X_train, y_train)
y_pred_xgb = Classifier_xgb.predict(X_train)
print(metrics.classification_report(y_pred_rf, y_test))

# Feature Importance
I know that there's a better way to do this (probably PCA), but I'm just making it up as I go along for now. I don't feel like reading through the documentation right this moment. Sorry.<br><br>
First, I'm just looking at the feature importances as they are in the RandomForestClassifier object. I admit that I don't know exactly what these values represent, but they should at least give me an idea of the importance of the features relative to each other.<br><br>
It surprised me that none of the categorical features were in the top ten; all were numeric columns in the original dataframe. I should probably aggregate the categorical features for a better picture, but I'm fairly sure that no sum of categorical feature importances would break the top ten. I'm going to focus on the top ten features for now, not because this is a good operating procedure, but because that's how I feel like doing it (for now).

In [None]:
Classifier_rf.feature_importances_
feature_importances = pd.Series(index = data_basic.columns[1:], data = Classifier_rf.feature_importances_)
feature_importances = feature_importances.sort_values(ascending = False)
ax = feature_importances[:10].plot.bar()
ax.set_title('Relative Feature Importance')

In [None]:
feature_importances

In [None]:
top_ten = list(feature_importances.index[:10])
data_basic[top_ten]

## Distribution Plots
Again, I'm just looking at the 10 most important features (all numeric) according to my random forest classifier so that I can wrap my head around the data. The first thing I did was plot the distribution of each variable, kde enabled, so the curves were scaled. Unlike my distribution plot earlier in this notebook, this helps me see differences in the distributions for churning and non-churning customers. The farther down the top ten list we go, the more the churning and non-churning curves look like each other. That makes sense.<br><br>
Next I'm doing 2-D kde plots for the top ten features. This should make any simple relationships between 2 variable and churning easily visible. It's here that I'm glad I didn't go with all features because a ten by ten subplot is bad enough to look at as it is. In fact, since the last three features have curves that look so much like each other, I think that I'll just drop those for the kde plots to make things more visible. That may bite me later, but I'm not going to worry about it for now.

In [None]:
figsize = (15, 12)
rows = 3
columns = 4
fig, axes = plt.subplots(rows, columns, figsize = figsize)

for i, col in enumerate(top_ten):
    axrow, axcol = int(np.floor(i/(rows+1))), i%columns
#     print(axrow, axcol)
    ax_active = axes[axrow, axcol]
    sns.distplot(data_basic[data_basic['Attrition_Flag']==0][col], 
                 hist = False, kde = True, label = 'Existing',
                ax = ax_active)
    sns.distplot(data_basic[data_basic['Attrition_Flag']==1][col], 
                 hist = False, kde = True, label = 'Attrited',
                ax = ax_active)
    ax_active.set_title(col)
    ax_active.set_xlabel(None)

In [None]:
rows = 7
columns = 7

# sns.pairplot(data = data_basic[top_ten:8])

fig, axes = plt.subplots(rows, columns, figsize = figsize)

for i, xcol in enumerate(top_ten[:7]):
    for j, ycol in enumerate(top_ten[:7]):
#         print(i, j, xcol, ycol)
        ax_active = axes[i, j]
        x = data_basic[data_basic['Attrition_Flag']==0][xcol]
        y = data_basic[data_basic['Attrition_Flag']==0][ycol]
        x2 = data_basic[data_basic['Attrition_Flag']==1][xcol]
        y2 = data_basic[data_basic['Attrition_Flag']==1][ycol]
        if xcol != ycol:
            ax_active.scatter(x, y, s = 1, alpha = .5)
            ax_active.scatter(x2, y2, s = 1, alpha = .5)
            ax_active.set_xticks([])
            ax_active.set_yticks([])
            ax_active.set_xlabel(xcol, fontsize = 8)
            ax_active.set_ylabel(ycol, fontsize = 8)

# fig.legend(['existing', 'attrited'])

I write this shortly after I got that subplotted figure above to work out. I'm not super happy about how this turned out. I'm glad that I finally got all of these subplots and ax functions to compile, but I don't think that I'm getting any super simple relationships out of this. Probably because there are none in two dimensions.<br><br>
I thought for a second that it might be interesting to add a third axis (requiring me to go and learn lots more code) so that I could visualize the relationships between three variables, but then I remembered that my monitor is two dimensional, so there would be no good way to look at the plots in relation to each other. Down this path lies madness. Anyway, that's what the ML algorithms are for, isn't it? To find patterns in higher dimensions that my monkey brain can't visualize?<br><br>
I think that there's just no easy answer to the question "how important is each feature?". Clearly there is some clustering in higher dimensions that a decision tree can pick up on, since the random forest classifier was able to predict with such high recall, if not precision. I think that I just have to conclude that there is no terribly helpful information in each feature looked at in isolation, or even in pairs.<br><br>
This plot helped me see that there's a pretty interesting relationship for any of the features involving Total_Amt_Chng_Q4_Q1. It looks like there's a pretty stark threshold for that feature that virtually all of the churning customers fall on one side of. Looking back at the plots of the distributions of each variable in isolation, I can see WHY this is. It looks like both the churned customers and non churned customers fall into normal-ish curves for both, and a good portion of the not-churning customers are not in a point of overlap with the churning curve.<br><br>
And then I realize that this makes a lot of sense. Total_Amt_Chng_Q1_Q4 is itself a combination of two variables. Clearly there is some clustering pattern on a plot of amtchgQ1 and amtchgQ4, and the financial institution this data comes from has already recognized it. Good on them, but I'm left here thinking that there (probably) remains no decisive clustering in 2 dimensions. When looking at any two of the features I worked with on that plot, the churning customers simply have a lot of overlap with not-churning customers, and no pair can by itself identify the churners.<br><br>
Also, a point of clarification: on those plots, orange dots represent churning customers and blue ones represent not-churning ones. Individual dots are too small to show up clearly, which made the overal legend I had here earlier unhelpful. I don't care to try and work out a way to improve the plots, and I don't see anything promising to use in these plots, so I'm staying out of that rabbit hole, thank you.