# Credit Card Churn

By Eric Wilson

The individual on Kaggle who submitted this data set said they need to predict customer churn, and have managed to get 62% as the highest accuracy. It's ok to predict someone who will stay as one who will churn, but the most important task is making sure everyone who will churn is not marked as someone who will stay. Let's see what we can do...

### Import libraries and data

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.metrics import confusion_matrix,classification_report

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
df.head()

First, we need to turn attrition/churn into numeric values, followed by gender, education, income, marital status, and card catagory. We also need to get rid of the last two columns.

In [None]:
print(df['Attrition_Flag'].value_counts())
print(df['Gender'].value_counts())
print(df['Education_Level'].value_counts())
print(df['Marital_Status'].value_counts())
print(df['Income_Category'].value_counts())
print(df['Card_Category'].value_counts())

In [None]:
df['Attrition_Flag'].replace({'Existing Customer' : 0, 'Attrited Customer' : 1},inplace = True)
df['Gender'].replace({'F': 0, 'M': 1}, inplace = True)
df['Education_Level'].replace({'Unknown' : 0, 'Uneducated' : 1, 'High School' : 2, 'College' : 3, 
                               'Graduate' : 4, 'Post-Graduate' : 5, 'Doctorate' : 6}, inplace = True)
df['Marital_Status'].replace({'Unknown' : 0, 'Single' : 1, 'Divorced' : 2, 'Married' : 3}, inplace = True)
df['Income_Category'].replace({'Unknown' : 0, 'Less than $40K' : 1, '$40K - $60K' : 2, '$60K - $80K' : 3,
                              '$80K - $120K' : 4, '$120K +' : 5}, inplace = True)
df['Card_Category'].replace({'Blue' : 0, 'Silver' : 1, 'Gold' : 2, 'Platinum' : 3}, inplace = True)
df.drop(df.columns[[0,21,22]].values,axis=1,inplace = True)
df.dtypes

In [None]:
df.head()

Now we have nothing but numbers. Let's start trying to build a model.

### Correlation

Let's start by seeing what features correlate most with Attrition, and which correlate with one another, in order to have an idea of what features may be more useful than others in an attempt to avoid data overload and overfitting.

In [None]:
df.corr()

It appears that the most correlated fields to churn are transaction counts, count change from Q4 to Q1, revolving balance, 12 month contact count, inactive months, utilization ratio,relationship count, and transaction amount. That being said, none of them share a particularly strong correlation, but essentially all of the demographic information (gender, income, education level, marital status, dependant count) lack any real correlation with churn.

### Model Building

In [None]:
dfm = df[['Attrition_Flag', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
          'Contacts_Count_12_mon', 'Total_Revolving_Bal', 'Total_Trans_Amt', 'Total_Trans_Ct',
          'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']]
dfm.corr()

In [None]:
x = dfm[['Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon',
        'Total_Revolving_Bal', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
        'Avg_Utilization_Ratio']]
y = dfm['Attrition_Flag']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=13, random_state=2)
model.fit(x, y)
rfvalue = model.predict(x_test)

print('Model Accuracy : ', accuracy_score(y_test, rfvalue) *  100)
print('Model Recall : ', recall_score(y_test, rfvalue) *  100)
print('Model Precision : ', precision_score(y_test, rfvalue) *  100)

In [None]:
print(confusion_matrix(y_test, rfvalue))
print(classification_report(y_test, rfvalue))

With this model, we still have roughly two dozen false negatives - what we've been asked to avoid. That being said, we're still looking at pretty high accuracy, precision, recall, and f1 scores.

## Conclusion

By narrowing down the data used to the factors which have the highest correlation to attrition, we're left with a pretty accurate model. I've tried to optimize it with larger and smaller train / test splits and random states, but the combination used in this notebook seemed to be pretty optimal.

### Addendum

I value feedback, tips, and criticism highly - I'm still fairly new to DS and ML, so if I make a mistake or error, I would greatly appreciate knowing so; the best way to learn is by doing, and it's better to fix an error before it becomes a habit.

Thank you for taking the time to read this notebook!