# Credit Card Retention Analysis

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot
sns.set()
pd.options.display.max_columns = 999

In [2]:
data = pd.read_csv('../data/BankChurners_v2.csv')

In [3]:
data = data[['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',]]

In [4]:
data['Education_Level'] = data['Education_Level'].fillna('Unknown')
data['Marital_Status'] = data['Marital_Status'].fillna('Unknown')
data['Income_Category'] = data['Income_Category'].fillna('Unknown')

In [18]:
# https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-5-binning-c5bd5fd1b950
bins = [25, 30, 40, 50, 60, 70, 80]
labels = ['20s', '30s', '40s', '50s', '60s', '70s']
data['Customer_Age_bins'] = pd.cut(data['Customer_Age'], bins=bins, labels=labels, include_lowest=True, right=False)

***

## Summary Statistics

Typically, we are looking to understand:

    1) how many instances are in the dataset (frequency or counts) 
    2) a measure of central tendency (mean, median, mode)
    3) the spread of the dataset (variance, standard deviation)

The **Mean** is the average of all values in a dataset, while the **Median** represents the midpoint of the values (50% above and 50% below. 

In [6]:
y = list(range(0,110,10))
y

[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

In [8]:
mean = sum(y)/len(y)
mean

50.0

In [9]:
np.mean(y)

50.0

In [10]:
np.median(y)

50.0

If I add one more data point to the set, 900, let's see how things change:

In [11]:
y.append(900)

In [14]:
np.mean(y)

120.83333333333333

In [15]:
np.median(y)

55.0

In python, we can use the `.describe()` method to see these metrics for all the numerical variables in the dataset including: `quantiles`, `min`, `max` and `std`.

`std` helps us understand how spread out the values of that variable are ---> the bigger the `std` the bigger the spread

In [16]:
data.describe()

Unnamed: 0,CLIENTNUM,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
count,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0
mean,95095.0,46.32596,2.346203,35.928409,3.81258,2.341167,2.455317,8631.953698,1162.814061,7469.139637,0.759941,4404.086304,64.858695,0.712222,0.274894
std,2923.557422,8.016814,1.298908,7.986416,1.554408,1.010622,1.106225,9088.77665,814.987335,9090.685324,0.219207,3397.129254,23.47257,0.238086,0.275691
min,90032.0,26.0,0.0,13.0,1.0,0.0,0.0,1438.3,0.0,3.0,0.0,510.0,10.0,0.0,0.0
25%,92563.5,41.0,1.0,31.0,3.0,2.0,2.0,2555.0,359.0,1324.5,0.631,2155.5,45.0,0.582,0.023
50%,95095.0,46.0,2.0,36.0,4.0,2.0,2.0,4549.0,1276.0,3474.0,0.736,3899.0,67.0,0.702,0.176
75%,97626.5,52.0,3.0,40.0,5.0,3.0,3.0,11067.5,1784.0,9859.0,0.859,4741.0,81.0,0.818,0.503
max,100158.0,73.0,5.0,56.0,6.0,6.0,6.0,34516.0,2517.0,34516.0,3.397,18484.0,139.0,3.714,0.999


Here we can see things like: 

    1) The longest customer in this dataset has been around for 56 months or about 4 years and a half. (Max)
    2) The average number of relationships a customer has is ~4. (Mean and median agree here)
    3) The average credit limit is $8.6K, but the median credit limit is much lower at $4.5K. (signals some skew in this variable) 

In [24]:
print('The average Total_Relationship_Count is', round(np.mean(data['Total_Relationship_Count']),2), 'and the median is', round(np.median(data['Total_Relationship_Count']),2))

The average Total_Relationship_Count is 3.81 and the median is 4.0


In [23]:
print('The average Credit_Limit is $', round(np.mean(data['Credit_Limit']),2), 'and the median is $',round(np.median(data['Credit_Limit']),2) )

The average Credit_Limit is $ 8631.95 and the median is $ 4549.0
