## Credit Card Churn Classifier
### Below I attempt to predict if a banks' credit card customer will churn or not.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')

In [None]:
df.shape

In [None]:
df.head()

---
## Remove columns that are not needed
### The first column 'CLIENTNUM' is just an id value and is of no use to this analysis. The final two columns appear to be results from some prior classification analysis that I have no use for. 

In [None]:
# Drop 'CLIENTNUM'
df.drop(columns='CLIENTNUM', inplace=True)

# Drop last two columns
df = df.iloc[:, :-2]

df.shape

In [None]:
df.info()

### There are no null values in the dataset. Prior to moving on, I rename the columns to make them less verbose.

In [None]:
df.rename(columns={'Customer_Age' : 'age',
                   'Gender' : 'sex',
                   'Dependent_count' : 'dependents',
                   'Education_Level' : 'education',
                   'Marital_Status' : 'mar_status',
                   'Income_Category' : 'income',
                   'Card_Category' : 'card_type',
                   'Months_on_book' : 'months_customer',
                   'Total_Relationship_Count' : 'customer_products',
                   'Months_Inactive_12_mon' : 'ttm_inactive',
                   'Contacts_Count_12_mon' : 'ttm_contact',
                   'Credit_Limit' : 'card_limit',
                   'Total_Revolving_Bal' : 'balance',
                   'Avg_Open_To_Buy' : 'available_credit',
                   'Total_Amt_Chng_Q4_Q1' : 'ttm_trans_chng',
                   'Total_Trans_Amt' : 'ttm_trans',
                   'Total_Trans_Ct' : 'ttm_trans_cnt',
                   'Total_Ct_Chng_Q4_Q1' : 'ttm_trans_cnt_chng',
                   'Avg_Utilization_Ratio' : 'util_ratio'}, inplace=True)

In [None]:
df.columns

### Next I went through every single column to ensure that all the values made sense. For the sake of brevity, below I only show the ones that I edited in some way.

---

## 'Attrition_Flag'
### This is the target/label variable. I first check to see the distribution of existing customers to attrited customers then I convert values like so:
* existing customers : 0
* attrited customers: 1

In [None]:
print(df['Attrition_Flag'].value_counts())

print(f'Existing Customer: {round(8500/10127, 4) * 100}%')
print(f'Attrited Customer: {round(1627/10127, 4) * 100}%')

In [None]:
df['Attrition_Flag'].replace({'Existing Customer' : 0,
                              'Attrited Customer' : 1}, inplace=True)

### Next I create a churn feature from 'Attrition_Flag' which is inserted at the end of the dataset and drop 'Attrition_Flag'

In [None]:
df['churn'] = df['Attrition_Flag']
df.drop(columns = {'Attrition_Flag'}, inplace=True)

---
## 'Gender' (I renamed this to 'sex')

### I convert this metric like so:
* M becomes 0
* F becomes 1

In [None]:
df['sex'].replace({'M' : 0, 'F' : 1}, inplace=True)

---
## Check correlation between features

In [None]:
correlation = np.round(df.corr(),2)

dropSelf = np.zeros_like(correlation)
dropSelf[np.triu_indices_from(dropSelf)] = True

plt.figure(figsize=(12,12))
sns.set_style("white")
sns.heatmap(data=correlation, annot=True, cmap=sns.diverging_palette(240, 10, n=15), mask=dropSelf)

### From the above correlation matrix it is clear that there is high correlation between:
* months_customer & age
* available_credit & card_limit
* util_ratio & balance
* ttm_trans_cnt & ttm_trans

### the only one I am going to deal with at this point is the perfectly correlated available_credit and card_limit. The available credit feature is simply the card_limit - balance, therefore it is redundant and I opt to remove it.

In [None]:
df.drop(columns='available_credit', inplace=True)
df.shape

---

## Naive Analysis
### For a baseline score I will simply take the data as-is and run it through some models.

In [None]:
# Create Feature and Target Variables
X = pd.get_dummies(df)
X.drop(columns='churn', inplace=True)
y = df['churn']

# Function to run and analyze each model
def run_model(x, y, name):
    # Split Data
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
   
    if name == 'logistic':
        model = LogisticRegression(max_iter=1000,
                                   solver='liblinear')
    elif name == 'logistic_bal':
        model = LogisticRegression(max_iter=1000,
                                   solver='liblinear',
                                   class_weight='balanced')     
    elif name == 'rand_forest':
        model = RandomForestClassifier()
    elif name == 'rand_forest_bal':
        model = RandomForestClassifier(class_weight='balanced_subsample')
    elif name == 'xgb':
        model = XGBClassifier()
    elif name == 'xgb_bal':
        # Note, the weight value was calculated using the two
        # churn target variable values:
        # (total non-churn customer) / (total churn)
        model = XGBClassifier(scale_pos_weight=5.22)
    else:
        print('Error, Incorrect Model')

    # Cross-Validation method 1:  cross_val_predict()
    # Note: if accuracy score is extremely high, may have overfitting
    cv_pred = cross_val_predict(model, X_train, y_train, cv=5)
    print(f'Training Data CV Score Method 1: {np.round(metrics.accuracy_score(y_train, cv_pred),4) * 100}%') 
        
    # Cross-Validation method 2:  cross_val_score()
    kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
    cv_result = cross_val_score(model, X_train, y_train, cv=kfold, scoring="accuracy")
    print(f'Training Data CV Score Method 2: {np.round(cv_result.mean(),4) * 100}%')

    # Fit model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)
    print(f'Testing Data Accuracy Score: {np.round(metrics.accuracy_score(y_test, y_pred), 4) * 100}%')

    # Classification Report
    print(f'\n\n{name} Classification Report:')
    print('------------------------------------------------------------')
    print(metrics.classification_report(y_test, y_pred))

    # Confustion Matrix Heat Map
    sns.heatmap(metrics.confusion_matrix(y_test,y_pred), annot=True, fmt=".0f")
    plt.title(f'{name} confustion matrix')
    plt.xlabel('Predicted Values')
    plt.ylabel('Actual Values')
    plt.show()

---
## Logistic Regression

In [None]:
run_model(X, y, 'logistic')

### The results for the basic logistic regression appear on paper to be decent with an overall 89% accuracy score, however the ability for this model to predict churn correctly is not good at only 51% accuracy. 

### The reasoning for this is due to the imbalance between the target variable values. Because only 16% of the samples are churned customers, there is a large disparity between those that have churned and those that have not.

### Thankfully sklearn provides various hyperparameter values to deal with imbalanced data. Below I re-run the logistic regression, however this time in include the hyperparameter: 
* class_weight = 'balanced'


In [None]:
# Balanced logisitc regression 
run_model(X, y, 'logistic_bal')

### Notice that after adding in the balanced hyperparameter, the overall accuracy score decreased, yet the accuracy for predicting churned customer dramatically increased, thus making it a more useful model overall as now both churned and non-churned customers are being predicted correctly roughly 85% of the time. 

### One negative side-effect of using the 'balanced' hyperparameter is that now I'm getting a lot of the non_churn customers being classified as churns.


### At this point I could try removing features or adjusting other metrics to improve the score, but since I was able to get a baseline that is reasonable, I opt to just try some other models to see if that alone improves the scores.

---

## Random Forest

In [None]:
# Unbalanced Random Forest
run_model(X, y, 'rand_forest')

In [None]:
# Random Forest Balanced
run_model(X, y, 'rand_forest_bal')

### The Random Forest results are interesting. First of all, the balanced score was less than the unbalanced. Second in both cases the prediction of the churned customers is only around 70% accurate. As things stand, this model is an improvement over unbalanced LR, but balanced LR is still better overall. 

---

## XGBoost XBGClassifier

In [None]:
# XGBClassifier Unbalanced 
run_model(X, y, 'xgb')

In [None]:
# XGBClassifier Balanced
run_model(X, y, 'xgb_bal')

### So the XGBClassifier wins on all levels, however the balanced model works the best. The final results have a 90% accuracy on predicting customers who will churn. With the limited data supplied, minimal model tuning, and no major changes to the data, I am satisfied with these results. 

### The scores could potentially be improved through various other means, I tried scaling the data, log transforming right-skewed features, and some other various things, but none of them improved the scores all that much (if at all).