# Credit Data Analysis
### Author: Alex Correa
### Last Publish Date: 12/6/20
### Submission Version: 2.0

## Problem Statement
Using the provided attrition data from the bank, generate a model that predicts customer attrition.

## Findings Summary
Using Random Forest Classification with one hot encoding for categorical data, we're able to achieve an accuracy rating of 96.10% based on a train/test/split sampling with 20% split (20% of the data reserved for testing and 80% for training).

## Future Iteration Goals
- Further validate accuracy with cross validation
- Further tune model with XGBoost
- Add further detailed commentary on charts and data exploration

## Notes
The cleanliness of this data really made this project enjoyable and able to stay focused on the exploration/modeling.  My goal in this notebook was to approach a data set different from most I've used in learning but still apply some of the tools I've gained from the Kaggle courses.  

## Changelog
- Version 1.0: Initial Publish
- Version 1.1: Title modification and changelog inclusion
- Version 2.0: We add charts to view the data in different dimensions

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Import
We start by getting the data imported with pandas to make our data frame. We're told that 'CLIENTNUM' is our unique id for customers so we'll use it as our index (if this were real world we'd validate by checking duplicates and ensuring it wasn't a relevant feature).

After the import we look at the first rows and print columns to get a quick peek.

In [None]:
holdingData = pd.read_csv("../input/credit-card-customers/BankChurners.csv",index_col='CLIENTNUM')
creditData = holdingData.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],axis=1)

## First thoughts
After looking WOW we got a lot of features.  We were encouraged to drop the last two columns so it would likely help with readability.

## Check for Missing Data
Before we get too crazy let's see what columns are missing data.  We wont condition the data here but use it to determine what's viable.  If we were to condition now we may risk data leakage.

In [None]:
missing_values_count = creditData.isnull().sum()
print(missing_values_count)

## Missing Data Findings
Soooo we're missing no data.  That's because we were given the data so we will assume someone did a lot of work for us.

## Data Exploration
The following cells will be used to explore the data.  We'll do this with some graphs and charts.  Seaborn and matplotlib will be our tool of choice for the time being.

In [None]:
creditData.columns

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# The below code for charts hangs a bit too long and includes charts that 
# clearly aren't the most appropriate for their data type.  I'm including
# it as a demonstration of some issues quick solutions like this can cause.

barColumns = ['Attrition_Flag','Customer_Age','Gender','Dependent_count','Education_Level','Marital_Status',
             'Income_Category','Card_Category','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon']

for i in creditData[barColumns]:
    cat_num = creditData[i].value_counts()
    # print("Graph for %s: total=%d" % (i,len(cat_num)))
    chart = sns.barplot(x=cat_num.index, y=cat_num)
    chart.set_xticklabels(chart.get_xticklabels(),rotation=90)
    chart.set_title("Graph for %s: total=%d" % (i,len(cat_num)))
    plt.show()

In [None]:
# This version of the plotting gives us the rest of the DF data to use for 
# Hues.  Surface level very few features look valuable aside from card_category

for i in creditData[barColumns]:
    cat_num = creditData[i].value_counts()
    chart = sns.countplot(x=i,data=creditData,hue="Attrition_Flag")
    chart.set_xticklabels(chart.get_xticklabels(),rotation=90)
    chart.set_title("Graph for %s: total=%d" % (i,len(cat_num)))
    plt.show()

In [None]:
# Now we need to chart some of our numeric values.  Here we look at 
numColumns = ['Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

# we're using histograms for this. In seaborn 11 we'll be able to do 
# these with a "hue" using displot (but oh well for now)
for i in creditData[numColumns]:
    chart = sns.distplot(a=creditData[i], kde=False)
    chart.set_title("Graph for %s: total=%d" % (i,len(cat_num)))
    plt.show()


In [None]:
# Don't know what to chart? Chart EVERYTHING.
sns.pairplot(creditData, hue='Attrition_Flag')

## Quick Chart Thoughts
Ultimately we charted a ton of different things.  Bars for counts on categorical data and histograms for the numerical data.  Finally we use the pairplot to visualize graphs of many features against each other numerically. 

I have some concerns about the credit utilization stats being sources of data leak. Yes, we can see some discrepencies from attrited and existing customers, but they're related to credit utilization. My immediate questions are:
- Do credit utilization categories include data collected for 12 months before?
- How could we count utilization if people attrited in February?  Naturally people with low utilization could imply attrite because they literally couldn't purchase 10 of the months.

I'll wrap this part up with a quick heatmap 

In [None]:
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(creditData.corr(), vmax=.3, center=0, cmap=cmap,
           square = True, linewidths=.5, cbar_kws={"shrink":.5})

## Feature Engineering
Here we begin to create our "features".  Honestly I'm going to do a run with all of them and create a classifier, but that will end poorly for me soon enough if we don't do the effort to convert data with encoding etc. Categorical data is definitely not the easiest thing to work with.

In future iterations the feature engineering section will be more scientific with us understanding the data better and making choices that prevent data leak.

In [None]:
# Features?
# here we will define smaller feature sets in the future


## Code Overview
Coming from a computer science background, I have a personal preference of leaving code blocks sequential to allow straightforward reading.  In the future I'll likely begin adding more context between blocks, but for now I'll do it all at the beginning.  One thing of note here is that I did not condition the categorical data before this step.  It could have been done for exploration, but I want to practice building scalable solutions and conditioning data in the pipeline for training and testing separately is a small step towards that.

The code executes as follows:
1. We generate our training and testing data from the entire sample.
2. We define numerical and categorical columns by type and use them to generate our working dataframe.
3. We begin building our pipeline that defines the rules for handling data and using the model.
4. We define an imputer (for potential missing data) and define one-hot-encoding for our categorical data.
5. We define the model we'll use which in this case is a random forest classifier.
6. We fit/train the model
7. We make predictions based on the newly trained model
8. We analyze the accuracy of our predictions
9. We map our predictions to the reality
10. We take a step back and think on how to iterate.

In [None]:
# let's get some train test split going
from sklearn.model_selection import train_test_split

X = creditData.drop("Attrition_Flag",axis=1)
y = creditData.Attrition_Flag

X_train_full, X_test_full, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)


In [None]:
# now we do some pipeline work
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()


In [None]:
# For a future opportunity we could create a new heatmap after 
# converting categorical data to empirical.  Because this is wrapped in 
# a pipeline for train and test we'll skip it.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score, plot_confusion_matrix

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_test)

# MAE does not work for classified data so we use accuracy 
preds_score=accuracy_score(y_test,preds)
print("Random Forest Classifier Success Rate :", "{:.2f}%".format(100*preds_score))
plot_confusion_matrix(clf, X_test, y_test)
plt.show()