# **Content**
1. Data Review
2. Exploratory Analysis
3. Feature Engineering
4. Prepare Train/Test Data for Modeling
5. Modeling
6. SMOTENC
7. Model Tuning
8. Final Evaluation on the Entire Dataset
9. Recommendations

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from imblearn.over_sampling import SMOTENC

from IPython.display import Image

pd.set_option('display.max_columns', None); pd.set_option('display.max_rows', None);

In [None]:
Image(filename="/kaggle/input/predictive-model-process/predictive_modeling.png")

In [None]:
# Import dataset
df = pd.read_csv('/kaggle/input/predicting-churn-for-bank-customers/Churn_Modelling.csv')

# **Quick Data Review**

In [None]:
# Check data types
df.dtypes

In [None]:
# Create data summary
df.describe()

Based on the summary, all variables have 10,000 values, there are no missing values 

In [None]:
# Check Churn rate
df["Exited"].value_counts()

Churn rate is ~20%, based on the %, it's an relatively unbalanced dataset

In [None]:
# Check the difference between stay v.s churn
df.groupby('Exited').agg('mean')

Clients who left have lower credit score, higher avg age, lower tenure, larger balance, lower # of prods, lower credit card%, lower active% and higher salary

In [None]:
# Check the impact of churned clients 
df.groupby('Exited').agg('sum')

Client attrition results in $186M investment balance loss and 3,005 products loss

In [None]:
# Drop RowNumber, CustomerId and Surname, which are not useful for modeling
df_ml = df.drop(['RowNumber','CustomerId','Surname'], axis = 1)

# **Exploratory Analysis and Data Preprocessing**

In [None]:
# Explore categorical/binary data 
fig, axarr = plt.subplots(2, 2, figsize = (20, 12))
sb.countplot(x = 'Geography', hue = 'Exited', data = df_ml, ax = axarr[0][0],palette='OrRd')
sb.countplot(x = 'Gender', hue = 'Exited', data = df_ml, ax = axarr[0][1],palette='OrRd')
sb.countplot(x = 'HasCrCard', hue = 'Exited', data = df_ml, ax = axarr[1][0],palette='OrRd')
sb.countplot(x = 'IsActiveMember', hue = 'Exited', data = df_ml, ax = axarr[1][1],palette='OrRd');

Based on the bar charts:
* 'Germany', 'Female', 'No Credit Card' and 'Not Active' groups have higher churn rates. 
* To understand why 'Germany' and 'Female' groups have higher churn rates, I would recommend the bank to perform deeper analysis to see whether clients in those groups have different behaviors/perferences compared to other groups, then design some specific loyalty programs to better engage with clients in those two groups
* Because clients who have credit card and are digitally active have lower churn rates, I would recommend the bank to run campaigns to promote its credit card and digital services 

In [None]:
# Explore continuous variables
def kdeplot(var):
    facet = sb.FacetGrid(df_ml, hue = 'Exited', aspect = 3,palette='OrRd')
    facet.map(sb.kdeplot, var, shade = True)
    facet.set(xlim = (0, df_ml[var].max()))
    facet.add_legend();
    
kdeplot('CreditScore')
kdeplot('EstimatedSalary')
kdeplot('Tenure')
kdeplot('Age')
kdeplot('Balance')
kdeplot('NumOfProducts')

Based on the kernel distributions:
* There are no significant differences in CreditScore, EstimatedSalary and Tenure distributions between 'exited' and 'not exited' clients. 
* Clients with higher Age have higher churn rates.
* Clients with higher Balance and higher NumofProducts have higher churn rates. This result may indicates that the clients have purchased more products than they need. Then when clients realized that, the trust between the clients and the bank would break and the clients would leave the bank. So I would recommend the bank to evaluate its sales strategy to balance the short-term profits v.s long-time client lifetime value.

In [None]:
# Check outliers
fig, axarr = plt.subplots(3, 2, figsize = (20, 12))
sb.boxplot(y = 'CreditScore',x = 'Exited', hue = 'Exited', data = df_ml, ax = axarr[0][0],palette='OrRd')
sb.boxplot(y = 'Age',x = 'Exited', hue = 'Exited', data = df_ml , ax = axarr[0][1],palette='OrRd')
sb.boxplot(y = 'EstimatedSalary',x = 'Exited', hue = 'Exited', data = df_ml, ax = axarr[1][0],palette='OrRd')
sb.boxplot(y = 'Balance',x = 'Exited', hue = 'Exited', data = df_ml, ax = axarr[1][1],palette='OrRd')
sb.boxplot(y = 'Tenure',x = 'Exited', hue = 'Exited', data = df_ml, ax = axarr[2][0],palette='OrRd')
sb.boxplot(y = 'NumOfProducts',x = 'Exited', hue = 'Exited', data = df_ml, ax = axarr[2][1],palette='OrRd');

Althogh there are points sitting outside of the upper and lower boundaries, but they are within the normal range based on common sense. For example: age is within [18,100], credit score is within [350, 850]. so I won't make any modification to those 'outliers'

In [None]:
# Check Multicollinearity 

# No highly linear correlation
f, ax = plt.subplots(figsize= [15,10])
sb.heatmap(df_ml.corr(), annot=True, fmt=".2f", ax=ax, cmap = "Blues" );

# **Feature Engineering**

My assumption is that credit score, salary, tenure and investment balance are correlated with age, so I want to check the true impacts of those 4 variables on the churn rate after controling age 

In [None]:
# Create new features
df_ml['CreditByAge'] = df_ml['CreditScore'] / df_ml['Age'] 
df_ml['SalaryByAge'] = df_ml['EstimatedSalary'] / df_ml['Age'] 
df_ml['TenureByAge'] = df_ml['Tenure'] / df_ml['Age'] 
df_ml['BalanceByAge'] = df_ml['Balance'] / df_ml['Age'] 

In [None]:
kdeplot('CreditByAge')
kdeplot('SalaryByAge')
kdeplot('TenureByAge')
kdeplot('BalanceByAge')

Based on the distributions, I can conclude that:
* If two clients are at the same age, the client who has lower credit score, lower salary, shorter tenure and lower balance have higher churn rate
* The clients who meet the above criterias are highly likely that they are falling behind their financial planning. I would recommend the bank to develop certain programs to help those clients catch up with their financial planning, by doing so, the bank can deepen relationship with those clients

# **Prepare Train/Test Data for Modeling**

In [None]:
# One-Hot encoding our categorical attributes
cat_vars = ['Geography']
df_ml = pd.get_dummies(df_ml, columns = cat_vars, prefix = cat_vars)

In [None]:
# Convert Gender to female:1, male:0
df_ml['Gender_Female'] = np.where(df_ml['Gender'] == 'Female', 1, 0)

In [None]:
num_vars = ['CreditByAge','SalaryByAge','TenureByAge','BalanceByAge','NumOfProducts']
bin_vars = ['HasCrCard','IsActiveMember','Geography_France','Geography_Germany','Geography_Spain','Gender_Female']

In [None]:
# Split data into train and test sets
seed = 7
test_size = 0.3

X = df_ml[num_vars + bin_vars]
Y = df_ml['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [None]:
# Scale continuous variables 
scaler = MinMaxScaler()
X_train_transform = scaler.fit_transform(X_train)
X_test_tranform = scaler.transform(X_test)

# **Model Training**

In [None]:
def model_train_eval(model, X_train, y_train, X_test, y_test):
    model_fit = model.fit(X_train, y_train)
    
    pred_prob = model_fit.predict_proba(X_test)[:,1]
    pred = model_fit.predict(X_test)
    
    f1 = f1_score(y_test.values, pred, average = 'macro')
    auc = roc_auc_score(y_test.values,pred_prob) 
    
    fpr, tpr, _ = roc_curve(y_test, pred_prob)
    
    return model_fit, f1, auc, fpr, tpr

In [None]:
def base_models(X_train, y_train, X_test, y_test, cv, scoring):
    model_list = []
    model_name_list = []
    f1_list = []
    auc_list = []
    cv_mean_list = []
    cv_std_list = []
    fpr_list = []
    tpr_list = []
    
    model_list.append(('KNN', KNeighborsClassifier(n_neighbors = 7)))
    model_list.append(('AdaBoost', AdaBoostClassifier(base_estimator = None, n_estimators = 200,
                                                      learning_rate = 0.2)))
    model_list.append(('GradientBoost', GradientBoostingClassifier(learning_rate = 0.1, n_estimators = 200)))
    model_list.append(('XGBoost', XGBClassifier(booster='gbtree', eta = 0.1, gamma = 0.01, 
                                                objective = 'binary:logistic', eval_metric = 'auc')))
    model_list.append(('Random Forest', RandomForestClassifier(n_estimators=10, criterion='entropy',
                                                               class_weight = 'balanced')))
    
    for name, model in model_list:
        model_fit, f1, auc, fpr, tpr = model_train_eval(model, X_train, y_train, X_test, y_test)
        cv_score = cross_val_score(model_fit, X_train, y_train, cv = cv, scoring = scoring)

        model_name_list.append(name)
        f1_list.append(f1)
        auc_list.append(auc)
        cv_mean_list.append(cv_score.mean())
        cv_std_list.append(cv_score.std())
        fpr_list.append(fpr)
        tpr_list.append(tpr)

    performance = {'CV_Mean': cv_mean_list, 'CV_Std': cv_std_list, 'Test F1': f1_list, 'Test AUC': auc_list}
    pf_metrics = pd.DataFrame(performance, index = model_name_list)
    
    #------ Plot ROC ------#      
    plt.figure(figsize = (12,6), linewidth= 1)
    for i in range(len(model_name_list)):
        plt.plot(fpr_list[i], tpr_list[i], label = model_name_list[i]+': '+ str(round(auc_list[i], 3)))
    plt.plot([0,1], [0,1], 'k--', label = 'Random guessing: 0.5')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve ')
    plt.legend(loc='best')
    plt.show()
    
    return pf_metrics

In [None]:
base_models(X_train_transform, y_train, X_test_tranform, y_test, 5, 'f1_macro')

# **Oversampling: SMOTENC**

In [None]:
# Create a balanced training dataset 
sm = SMOTENC(list(range(len(num_vars),10)),random_state = 101)
X_train_over, y_train_over = sm.fit_resample(X_train_transform, y_train)

In [None]:
base_models(X_train_over, y_train_over, X_test_tranform, y_test, 5, 'f1_macro')

According to the SMOTENC results, the oversampling method improves the CV F1 score, but I also noticed that the results based on CV is better than the results based on Test data, so there is overfitting issue using SMOTENC

Because GradientBoost and XGBoost models outperform other models in both situations, so I choose those two models for further parameter tuning

# **Model Tuning**

## **GradientBoost**

In [None]:
#### Tuning based on oversampled data ####
gb_Params = {'learning_rate' : [0.01, 0.05, 0.1, 0.15, 0.2],
             'n_estimators': list(range(50, 300, 50)),
             'min_samples_split': [60, 80, 100, 120],
             'min_samples_leaf':[30, 40, 50, 60],
             'max_depth': [5, 6, 7, 8]}

# Initialization
gbModelOver = GradientBoostingClassifier(random_state = 1234)

randSearchGbOver = RandomizedSearchCV(estimator = gbModelOver, param_distributions = gb_Params, n_iter = 100, 
                                      verbose=1, scoring = 'f1_macro', cv = 3, random_state = 1234)
# Fit model with oversampled data
randSearchGbOver.fit(X_train_over, y_train_over)

# Print best parameters and best score
randSearchGbOver.best_params_, randSearchGbOver.best_score_

In [None]:
#### Tuning based on orginal data ####
gb_Params = {'learning_rate' : [0.01, 0.05, 0.1, 0.15, 0.2],
             'n_estimators': list(range(50, 300, 50)),
             'min_samples_split': [60, 80, 100, 120],
             'min_samples_leaf':[30, 40, 50, 60],
             'max_depth': [5, 6, 7, 8]}

# Initialization
gbModel = GradientBoostingClassifier(random_state = 1234)

randSearchGB = RandomizedSearchCV(estimator = gbModel, param_distributions = gb_Params, n_iter = 100, verbose=1,
                                   scoring = 'f1_macro', cv = 3, random_state = 1234)
# Fit model
randSearchGB.fit(X_train_transform, y_train)

# Print best parameters and best score
randSearchGB.best_params_, randSearchGB.best_score_

In [None]:
def f1_auc(model, model_name, x_train, y_train, x_test, y_test):
    bestModel = model.best_estimator_.fit(x_train, y_train)
    
    bestPredProb = bestModel.predict_proba(x_test)[:,1]
    bestPred = bestModel.predict(x_test)
    
    bestAUC = roc_auc_score(y_test.values,bestPredProb)
    bestF1 = f1_score(y_test.values, bestPred, average = 'macro')
    
    print('{} - F1: {:.3}, AUC: {:3}'.format(model_name, bestF1, bestAUC))
    
    return bestPred, bestModel

In [None]:
# Evaluate models on test dataset
bestBGPredOver, bestBGModelOver = f1_auc(randSearchGbOver,'GBboost + SMOTENC', X_train_over, y_train_over, X_test_tranform, y_test )


bestBGPred, bestBGModel = f1_auc(randSearchGB,'GBboost', X_train_transform, y_train, X_test_tranform, y_test )

In [None]:
# Create classification report
print('Classfication report for oversampled dataset')
print(classification_report(y_test.values, bestBGPredOver))

print('\nClassfication report for orginal dataset')
print(classification_report(y_test.values, bestBGPred))

In [None]:
def feature_imp(model):
    feature = pd.Series(model.feature_importances_, index = X.columns).sort_values(ascending = False)

    sb.barplot(x = feature, y = feature.index)
    plt.xlabel('Features')
    plt.ylabel('Importance')
    plt.show()

In [None]:
feature_imp(bestBGModelOver)

In [None]:
feature_imp(bestBGModel)

## **XGBoost**

In [None]:
#### Tuning based on oversampled data ####
xg_Params = {'eta' : [0.01, 0.05, 0.1, 0.15, 0.2],
             'gamma': [0, 0.01,0.02],
             'reg_alpha': [0, 0.5, 1],
             'reg_lambda': [1, 1.5, 2],
             'subsample': [0.6, 0.8, 1],
             'n_estimators': list(range(50, 500, 50)),
             'max_depth': [5, 6, 7, 8],
             'min_child_weight': [0.5, 1.0, 3.0, 5.0]}

# Initialization
xgModelOver = XGBClassifier(booster='gbtree', objective = 'binary:logistic', eval_metric = 'auc', seed = 123)

randSearchXgOver = RandomizedSearchCV(estimator = xgModelOver, param_distributions = xg_Params, n_iter = 100, 
                                      verbose=1,scoring = 'f1_macro', cv = 3, random_state = 1234)

# Fit model with oversampled data
randSearchXgOver.fit(X_train_over, y_train_over)

# Print best parameters and best score
randSearchXgOver.best_params_, randSearchXgOver.best_score_

In [None]:
#### Tuning based on orginal data ####
xg_Params = {'eta' : [0.01, 0.05, 0.1, 0.15, 0.2],
             'gamma': [0, 0.01,0.02],
             'reg_alpha': [0, 0.5, 1],
             'reg_lambda': [1, 1.5, 2],
             'subsample': [0.6, 0.8, 1],
             'n_estimators': list(range(50, 500, 50)),
             'max_depth': [5, 6, 7, 8],
             'min_child_weight': [0.5, 1.0, 3.0, 5.0]}

# Initialization
xgModel = XGBClassifier(booster='gbtree', objective = 'binary:logistic', eval_metric = 'auc', seed = 123)

randSearchXG = RandomizedSearchCV(estimator = xgModel, param_distributions = xg_Params, n_iter = 100, verbose=1,
                                   scoring = 'f1_macro', cv = 3, random_state = 1234)

# Fit model
randSearchXG.fit(X_train_transform, y_train)


# Print best parameters and best score
randSearchXG.best_params_, randSearchXG.best_score_

In [None]:
# Evaluate models on test dataset
bestXGPredOver, bestXGModelOver = f1_auc(randSearchXgOver,'XGboost + SMOTENC', X_train_over, y_train_over, X_test_tranform, y_test )
bestXGPred, bestXGModel = f1_auc(randSearchXG,'XGboost', X_train_transform, y_train, X_test_tranform, y_test )

In [None]:
# Create classification report
print('Classfication report for oversampled dataset')
print(classification_report(y_test.values, bestXGPredOver))

print('\nClassfication report for orginal dataset')
print(classification_report(y_test.values, bestXGPred))

In [None]:
feature_imp(bestXGModelOver)

In [None]:
feature_imp(bestXGModel)

Based on the exploratory analysis above, the churned clients result in $186M balance loss and 3,005 product loss, so the capability to correctly classify churn clients will be the most important metric I am looking for. That's the reason why I choose Recall(class=1) to measure the model performance.

After comparing the classification reports of the four models, the XGboost + SMOTENC has the highest Recall score, so I choose this combination as my final model.

# **Final Model Evaluation on Entire Dataset**

In [None]:
# Concate X_train_transform and X_test_tranform
X_transform = np.concatenate((X_train_transform,X_test_tranform), axis=0) 
Y = np.concatenate((y_train,y_test), axis=0) 

In [None]:
# Use the best parameter to fit/predict entire dataset
finalModel = randSearchGbOver.best_estimator_.fit(X_transform, Y)

finalPredProb = finalModel.predict_proba(X_transform)[:,1]
finalPred = finalModel.predict(X_transform)

finalAUC = roc_auc_score(Y,finalPredProb)
finalF1 = f1_score(Y, finalPred, average = 'macro')

print('Final model results - F1: {:.3}, AUC: {:3}'.format(finalF1, finalAUC))

In [None]:
# Check final prediction output 
confusion_matrix(Y, finalPred)

# **Recommendations**

**Data enrichment**
* Historical investment balance and product data: when clients are planning to leave, they will start withdrawing money from accounts and terminating their products/services, this trend is a good indicator for client attribution
* Client survey scores and comments: survey results can help CWM understand whether its clients are satisfied with the services
* Client engagement from different footprints: tracking clients footprints from different sources, such as website, email, mail, social media, etc., can help CWM understand how different client segments interact with different channels and which channels have better client retention rates

**Other feature engineering methods**
* Try different data transformation methods, such as ratio, log transform, polynomial transformation, etc., to capture non-linear relationships between dependent and independent variables

**Other Machine Learning models**
* Try other classifications models, such as SVM, Neural Network, etc., to find potential opportunities to improve prediction results