# Analyzing Credit Card Customer Churn Behaviour 
**Problem Statement:** A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is considering leaving the bank so they can proactively go to the customer to provide them better services and reverse the customers' decision in their favour.

**Data Source and Description:**
- https://leaps.analyttica.com/sample_cases/11
- https://www.kaggle.com/sakshigoyal7/credit-card-customers

The dataset consists of records of 10,127 bank customers (rows) and 20 columns describing various features viz.- 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'. We have both categorical as well as numerical features. We select the best of these features and model customer attrition behaviour using these features. Such a prediction system can act as an early warning system for the bank and incentivize them to do the needful to retain customers.

**Columns:**

- **Clientnum**	Num	Client number. Unique identifier for the customer holding the account
- **Attrition_Flag**	char	Internal event (customer activity) variable
- **Customer_Age**	Num	Demographic variable - Customer's Age in Years
- **Gender**	Char	Demographic variable - M=Male, F=Female
- **Dependent_count**	Num	Demographic variable - Number of people dependents
- **Education_Level**	Char	Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
- **Marital_Status**	Char	Demographic variable - Married, Single, Unknown
- **Income_Category**	Char	Demographic variable - Annual Income Category of the account holder (< 40K, 40K - 60K, 60K - 80K, 80K-120K, > 120K, Unknown)
- **Card_Category**	Char	Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
- **Months_on_book**	Num	Months on book (Time of Relationship)
- **Total_Relationship_Count**	Num	Total no. of products held by the customer
- **Months_Inactive_12_mon**	Num	No. of months inactive in the last 12 months
- **Contacts_Count_12_mon**	Num	No. of Contacts in the last 12 months
- **Credit_Limit**	Num	Credit Limit on the Credit Card
- **Total_Revolving_Bal**	Num	Total Revolving Balance on the Credit Card
- **Avg_Open_To_Buy	Num**	Open to Buy Credit Line (Average of last 12 months)
- **Total_Amt_Chng_Q4_Q1**	Num	Change in Transaction Amount (Q4 over Q1) 
- **Total_Trans_Amt	Num**	Total Transaction Amount (Last 12 months)
- **Total_Trans_Ct	Num**	Total Transaction Count (Last 12 months)
- **Total_Ct_Chng_Q4_Q1**	Num	Change in Transaction Count (Q4 over Q1) 
- **Avg_Utilization_Ratio**	Num	Average Card Utilization Ratio


## Index:
1. Exploratory Data Analysis and Visualizations
2. Outlier Management
3. Correlation heatmap and multicollinearity 
4. Feature Selection and further preprocessing 
5. Balancing the data (we note that only 1624 of the 10,127 customers end up leaving the bank, such an imbalance can affect classification models adversely)
6. Model Creation and Training (Ours is a binary classification problem)
    - Decision Tree Classifier 
    - Random Forest Classifier
    - Gradient Boost Classifier 
    - K Nearest Neighbor Classifier
7. Evaluation Metrics 
8. Insights and Conclusion
9. References

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics  import confusion_matrix, classification_report, accuracy_score, f1_score, roc_auc_score, roc_curve
from statsmodels.stats.outliers_influence import variance_inflation_factor
import imblearn

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1', 
        'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2',
        'CLIENTNUM'], axis = 1, inplace = True)
df.to_csv('bank_churners.csv')
print(df.shape)
df.head()

-----------------------------------------------------------------

# Exploratory Data Analysis
- Duplicate rows 
- Outliers
- Null values 
- Correlations
- Ditributions and Visualizations

In [None]:
df.info()

In [None]:
df.isnull().sum() 

# no null values 

In [None]:
df.duplicated().sum() 

# no duplicate rows 

#### The dataset is free from missing values and duplicate rows.

In [None]:
df.columns

In [None]:
df = df.rename(columns = {'Attrition_Flag': 'att', 
                          'Customer_Age': 'age', 'Gender': 'gender', 'Dependent_count': 'dep',
                          'Education_Level': 'edu', 'Marital_Status': 'marital', 'Income_Category': 'income', 
                          'Card_Category': 'card','Months_on_book': 'book_months', 
                          'Total_Relationship_Count': 'rel', 'Months_Inactive_12_mon': 'inactive',
                          'Contacts_Count_12_mon': 'contacts', 'Credit_Limit': 'credit_limit', 
                          'Total_Revolving_Bal': 'rev_bal', 'Avg_Open_To_Buy': 'buy', 
                          'Total_Amt_Chng_Q4_Q1': 'amt_change', 'Total_Trans_Amt': 'trans_amt',
                          'Total_Trans_Ct': 'trans_count', 'Total_Ct_Chng_Q4_Q1': 'trans_count_change', 
                          'Avg_Utilization_Ratio': 'util'})
df.columns

In [None]:
catcols= []
numcols = []
for column in df.columns:
    if df[column].dtype == 'object':
        catcols.append(column)
    elif df[column].dtype in ['int64', 'float64']:
        numcols.append(column)

print(catcols)
print("Number of categorical columns: ", len(catcols))
print()
print(numcols)
print("Number of categorical columns: ", len(numcols))

---------------------------------------------

## Categorical Variables 
- Nominal: Gender, Marital, att   ---- will need one hot encoding
- Ordinal: Education, Income, Card 

In [None]:
c = 1
plt.subplots(figsize=(17, 17))
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.5, 
                    wspace=0.4, 
                    hspace = 0.3)
for column in catcols:
    counts = df[column].value_counts()
    print(pd.DataFrame(counts))
    plt.subplot(2, 3, c)
    plt.pie(df[column].value_counts(), labels = counts.index)
    plt.title(f'{column} Class Distribution')
    print("---------")
    c = c+1

plt.show()

In [None]:
for column in catcols:
    print(df.groupby(column).att.value_counts())
    print('------------------------------------------')
    
# we note that we have only 20 records for platinum card owners 

In [None]:
c = 1
plt.subplots(figsize=(15, 15))
plt.subplots_adjust(left = 0.1,
                    bottom = 0.1, 
                    right = 0.9, 
                    top = 0.5, 
                    wspace = 0.4, 
                    hspace = 1.0) # space between rows 
for column in catcols:
    plt.subplot(2, 3, c)
    figure = sns.countplot(x = df[column], hue = df['att'], edgecolor = 'black', palette = "Set2")
    figure.set_xticklabels(labels = figure.get_xticklabels(), rotation=45)
    c = c+1
    plt.title(f'{column} and Attrition Behaviour')
    
plt.show()

In [None]:
# label encoding nominal variables, these will require one hot encoding 
le = {}
for column in ['att', 'gender', 'marital']: 
    le[f'le_{column}'] = LabelEncoder()
    labels = le[f'le_{column}'].fit_transform(df[column])
    df.insert(df.columns.get_loc(column) + 1, f'{column}_l', labels)
print(le) # dictionary: le contains all the fitted encoders
df.head()
# dictionary of already fit label encoders, can be used to transform user input while deploying 

In [None]:
catcols

In [None]:
df.card.value_counts()

In [None]:
# labels for ordinal features (could not use LabelEncoder since these variables are ordinal)
edu_l = []
income_l = []
card_l = []

for i, data in df.iterrows():
    if data['edu'] == 'Unknown':
        edu_l.append(0)
    if data['edu'] == 'Uneducated':
        edu_l.append(1)
    if data['edu'] == 'High School':
        edu_l.append(2)
    if data['edu'] == 'College':
        edu_l.append(3)
    if data['edu'] == 'Graduate':
        edu_l.append(4)
    if data['edu'] == 'Post-Graduate':
        edu_l.append(5)
    if data['edu'] == 'Doctorate':
        edu_l.append(6)
        
    if data['income'] == 'Unknown':
        income_l.append(0)
    if data['income'] == 'Less than $40K':
        income_l.append(1)
    if data['income'] == '$40K - $60K':
        income_l.append(2)
    if data['income'] == '$60K - $80K':
        income_l.append(3)
    if data['income'] == '$80K - $120K':
        income_l.append(4)
    if data['income'] == '$120K +':
        income_l.append(5)
        
    if data['card'] == 'Blue':
        card_l.append(0)
    if data['card'] == 'Silver':
        card_l.append(1)
    if data['card'] == 'Gold':
        card_l.append(2)
    if data['card'] == 'Platinum':
        card_l.append(3)

In [None]:
df.insert(df.columns.get_loc('edu') + 1, 'edu_l', edu_l)
df.insert(df.columns.get_loc('income') + 1, 'income_l', income_l)
df.insert(df.columns.get_loc('card') + 1, 'card_l', card_l)
df.columns

In [None]:
for column in catcols:
    print(df[column].value_counts())
    print(df[f'{column}_l'].value_counts())
    print("---------------------------")

In [None]:
df.to_csv('df1.csv', index = False)

---------------------------------------------

## Numerical Variables 

In [None]:
# Computing variances 
for column in numcols:
    print(f'{column} variance: ', df[column].var())

In [None]:
df[numcols].describe()

## Detecting and treating outliers

In [None]:
def outlier(column):
    
    # distribution
    sns.histplot(df[column])
    plt.show()
    
    # boxplot
    sns.boxplot(df[column])
    plt.show()
    
    print('max: ', df[column].max())
    print('min: ', df[column].min())

    print('Mean: ', df[column].mean())
    print()
    
    print('IQR STRATEGY ------------')
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    r_whisker = Q3 + 1.5*IQR
    l_whisker = Q1 - 1.5 * IQR
    outsIQR = df[(df[column] > r_whisker) | (df[column] < l_whisker)] 
    
    print('Q1: ', Q1)
    print('Q2 (median): ', df[column].median())
    print('Q3: ', Q3)

    print('Left whisker: ', l_whisker)
    print('Right whisker: ', r_whisker)
    
    # Quantile Strategy
    print()
    print('QUANTILE STRATEGY ------------')
    max_val = df[column].quantile(0.95)       
    min_val = df[column].quantile(0.05)           

    outsQ = df[(df[column] < min_val) | (df[column] > max_val)]
    
    print('95th percentile: ', max_val)
    print('5th percentile: ', min_val)

    ###########
    print( )
    print(f'(IQR Strategy) NUMBER OF OUTLIERS DETECTED IN THE {column} COLUMN: {len(outsIQR)}')
    print(f"(Quantile Strategy) NUMBER OF OUTLIERS DETECTED IN {column} COLUMN:", outsQ.shape[0])

## Age:

In [None]:
outlier('age')

In [None]:
# dealing with outliers: since 70 and 73 are not too different from the right whisker, we equate them
for i, data in df.iterrows():
    if data.age > 68.5:
        df.at[i, 'age'] = 68.5

In [None]:
sns.boxplot(df.age) # all clean

## dep: number of people dependent on the customer 

In [None]:
outlier('dep')

## book_months: time on books in months 
How long a person has been a customer of the bank in months

In [None]:
outlier('book_months')

In [None]:
df.book_months.std()

In [None]:
print(df[df.book_months < 17.5].book_months.describe()) # outliers on the left
print(df[df.book_months > 53.5].book_months.describe()) # outliers on the right 

# we note than both sets of outliers less than the left whisker and greater than the right whisker have a difference of at most only 3.5 months, hence
# we choose to equate them with their nearest whisker values (winsorizing)

In [None]:
for i, data in df.iterrows():
    if data.book_months < 17.5:
        df.at[i, 'book_months'] = 17.5
    elif data.book_months > 53.5:
        df.at[i, 'book_months'] = 53.5
sns.boxplot(df.book_months)

## rel: total relationship count (total number of products held by the customer)

In [None]:
outlier('rel')

## inactive: Number of months the customer was inactive in the last 12 months 

In [None]:
outlier('inactive')

In [None]:
for i, data in df.iterrows():
    if data.inactive < 1:
        df.at[i, 'inactive'] = 1
    elif data.inactive > 4:
        df.at[i, 'inactive'] = 4

## Contacts: number of contacts the customer had with the bank in the last 12 months 

In [None]:
df.contacts.value_counts()

In [None]:
outlier('contacts')

In [None]:
for i, data in df.iterrows():
    if data.contacts < 1:
        df.at[i, 'contacts'] = 1
    elif data.contacts > 4:
        df.at[i, 'contacts'] = 4

## Credit Limit 

In [None]:
outlier('credit_limit')

In [None]:
df[(df.credit_limit < 1438.52) | (df.credit_limit > 34516.0)].credit_limit.describe()

In [None]:
# Replacing outlier values with the 5th percentile value
for i, data in df.iterrows():
    if data['credit_limit'] < 1438.51:
        df.at[i, 'credit_limit'] = 1438.51

# Revolving Balance: total revolving balance on the credit card

In credit card terms, a revolving balance is the portion of credit card spending that goes unpaid at the end of a billing cycle. The amount can vary, going up or down depending on the amount borrowed and the amount repaid. If you revolve a balance — that is, not pay it off at the end of the month — the lender will charge you for the privilege of borrowing their money. The amount of the charge for revolving a balance will depend on the size of the balance and the interest rate of the card. When the balance is paid off, the customer is no longer revolving the debt.

In [None]:
outlier('rev_bal') # no outliers 

## Buy: open to buy credit line in the last 12 months: credit_limit - rev_bal

In [None]:
outlier('buy')

In [None]:
df[df.buy > 22660.75].buy.describe() # outliers on the right, that is, points greater than the right whisker value

In [None]:
df[df.buy > 27709].buy.describe()

## amt_change: Change in Transaction Amount (Q4 over Q1) 

In [None]:
outlier('amt_change')

In [None]:
df[df.amt_change > 1.2].amt_change.describe()

In [None]:
df[df.amt_change > 1.55].amt_change.describe()

In [None]:
# dropping amt_change values beyond 1.55 
df.drop(index = df[df.amt_change > 1.55].index, inplace = True)
df.shape

In [None]:
df[df.amt_change < 0.289].amt_change.describe() 
# we note that these outlier values are not greatly different than the lower whisker, hence we equate them to the loer whisker 

In [None]:
# equating amt_change values beyond 1.2 (right whisker) and less than 1.55 -- 261 values 
for i, data in df.iterrows():
    if data['amt_change'] > 1.2:
        df.at[i, 'amt_change'] = 1.2
    if data['amt_change'] < 0.289:
        df.at[i, 'amt_change'] = 0.289

## Trans_amt: total amount in transactions in the last 12 months

In [None]:
outlier('trans_amt')

In [None]:
df[df.trans_amt > 8619.25].trans_amt.describe()

In [None]:
df[df.trans_amt > 15000].trans_amt.describe() # to be dropped 

In [None]:
df.drop(index = df[df.trans_amt > 15000].index, inplace = True)

In [None]:
# equating amt_change values beyond 1.2 (right whisker) and less than 1.55 -- 261 values 
for i, data in df.iterrows():
    if data['trans_amt'] > 8620:
        df.at[i, 'trans_amt'] = 8620

## Trans_count: Number of transactions in the last month

In [None]:
outlier('trans_count')

In [None]:
for i, data in df.iterrows():
    if data['trans_count'] > 135:
        df.at[i, 'trans_count'] = 135

## trans_count_change: change in transaction count Q4 over Q1

In [None]:
outlier('trans_count_change')

In [None]:
df[df.trans_count_change > 1.5].trans_count_change.describe()

In [None]:
df[df.trans_count_change < 0.23].trans_count_change.describe()

In [None]:
df.drop(index = df[df.trans_count_change > 1.5].index, inplace = True)

In [None]:
for i, data in df.iterrows():
    if data['trans_count_change'] > 1.16:
        df.at[i, 'trans_count_change'] = 1.16
    if data['trans_count_change'] < 0.23:
        df.at[i, 'trans_count_change'] = 0.23

## util: average card utilization ratio = rev_bal / credit_limit

In [None]:
outlier('util')

In [None]:
df.to_csv('df2_outs.csv', index = False)

---------------------------------------------

## Correlation Matrix 

In [None]:
plt.figure(figsize = (12, 7))
sns.heatmap(df[numcols].corr(), cmap = 'YlGnBu', annot = True)
plt.show()

## Multicollinearity:

(Multi-collinearity will not be a problem for certain models. Such as random forest or decision tree. For example, if we have two identical columns, decision tree / random forest will automatically "drop" one column at each split. And the model will still work well.)
- **We note strong correlations between:**
    - book_months and age 
    - buy and credit limit
    - trans_count and trans_amt 
    - util and rev_bal 
    
    
- **We also note the 2 derived features:**
    - 'buy' (open to buy credit line) = credit_limit - rev_bal
    - 'util' (Average card utilization ratio) = rev_bal / credit_limit
    
    
- Variance Inflation Factor: 
    - 1 = no multicollinearity
    - 5 = moderate
    - greater = severe multicollinearity
    
    
- We drop the age column because of its high correlation with book_months. Also, book_months (number of months the customer has been with the bank) is more relevant to our problem as well.
- We drop the credit_limit as well as the rev_bal columns, since they are significantly correlated to the buy and util columns respectively, also, the 2 latter columns have been derived from the 2 former columns.
- We also drop the trans_count column owing to its strong correlation with the trans_amt column 
- **Unnecessary for tree based algorithms however, we can reduce training time by removing irrelevant features**

**Note:** Tree based algorithms are not affected my variance in the data and hence do not necessarily require feature scaling

In [None]:
df = pd.read_csv('df2_outs.csv')
print(df.shape)
df.head()

In [None]:
tree_df = df.drop(['att', 'age', 'gender', 'edu', 'marital', 'income', 'card', 'credit_limit', 'rev_bal', 'trans_count'], axis = 1)
tree_df.head()
# 4 strongly correlated columns and original categorical variables deleted 
# strongly correlated columns are removed to avoid over fitting 

In [None]:
X = tree_df.drop('att_l', axis = 1)
y = tree_df['att_l']
X.shape, y.shape

---------------------------------------------

## Feature Selection

**Three benefits of performing feature selection before modeling your data are:**
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.

### Selecting most relevant features using the chi2 method for categorical data and ANOVA f test (f_classif())for continuous data 


- **Hypotheses: Ho - variable has no impact, H1 - variable has significant impact**


- **categorical variables**: gender_l, edu_l, marital_l, income_l, card_l
- **numerical/continuous variable**: book_months, dep, inactive, rel, contacts, buy, amt_change, trans_amt, trans_count_change, util

In [None]:
# feature selection using chi2 for categorical variables
chi_sel = chi2(X[['gender_l', 'edu_l', 'marital_l', 'income_l', 'card_l']], y)
chi_sel
chi_stats = chi_sel[0]
p_values = chi_sel[1]
d = {'chi_stats': list(chi_stats), 'p_values': list(p_values)}
cp = pd.DataFrame(data = d, index = ['gender_l', 'edu_l', 'marital_l', 'income_l', 'card_l'])
cp = cp.sort_values(by = 'p_values', ascending = True)
cp

### At 95% level of siginificance we note that'gender_l' is the only column that has a sigificant effect on attrition rate

In [None]:
# feature selection using f_classif for continuous/numerical data 
f_sel = f_classif(X[['book_months', 'dep', 'inactive', 'rel', 'contacts', 'buy', 'amt_change', 'trans_amt', 'trans_count_change', 'util']], y)
f_values = f_sel[0]
p_values = f_sel[1]
d = {'f_values': list(f_values), 'p_values': list(p_values)}
fp = pd.DataFrame(data = d, index = ['book_months', 'dep', 'inactive', 'rel', 'contacts', 'buy', 'amt_change', 'trans_amt', 'trans_count_change', 'util'])
fp = fp.sort_values(by = 'p_values', ascending = True)
fp

### At 95% level of significance we note that the columns 'dep', 'book_months' and 'buy' do not have significant effect on attrition.

In [None]:
# Hence, we remove the variables obtained above that have negligable effect on attritipn (dependent variable)
X_dropped = X.drop(['marital_l', 'card_l', 'edu_l', 'income_l', 'dep', 'buy', 'book_months'], axis = 1)
print(X_dropped.shape)
X_dropped.head()

---------------------------------------------

## Dealing with imbalance in our dependent variable using Synthetic Minority Oversampling TEchnique (SMOTE) -- data augmentation
- Models built using imbalanced datasets can result in poor performance and bias in training owing to imbalance in the distribution of dependent variable values.
- One of the simplest approaches to dealing with such data is to 'oversample the minority class'. In our case, the minority class is the number of customers to choose to leave the bank.
- The SMOTE technique creates new records for the minority class derived from existing records, hence this step does not add any new information to the data.

*SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.* -- Page 47, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

In [None]:
y.value_counts() # imbalanced target variable

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 2)
X_dropped_sm, y_sm = sm.fit_resample(X_dropped, y.ravel())
  
print('Before oversampling X_dropped shape: ', X_dropped.shape)
print('After OverSampling, the shape of X_dropped: {}'.format(X_dropped_sm.shape))
print('After OverSampling, the shape of y: {} \n'.format(y_sm.shape))
  
print("After OverSampling, counts of label '1': {}".format(sum(y_sm == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_sm == 0)))

# new synthetic data records generated for att: 0 = 6443
# updated total number of records: 9691 + 6443 = 16,134  

In [None]:
X_dropped_sm.shape, y_sm.shape

In [None]:
plt.subplot(1, 2, 1)
sns.scatterplot(data = X_dropped, x = 'util', y = 'trans_count_change', hue= y)
plt.title('Before Balancing')
plt.show()

plt.subplot(1, 2, 2)
sns.scatterplot(data = X_dropped_sm, x = 'util', y = 'trans_count_change', hue = y_sm)
plt.title('After balancing')
plt.show()

In [None]:
X_dropped_sm.shape

## Train and Test Splits

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_dropped_sm, y_sm, test_size = 0.2, random_state = 1, shuffle = True)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

---------------------------------------------

# MODELS: (Tree based algorithms: Decision Tree, Random Forest and Gradient Boosting)

## Decision Tree 1

In [None]:
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, X_train, y_train, cv = 10)
scores 

In [None]:
dt1 = dt.fit(X_train,  y_train)

In [None]:
y_pred_dt1 = dt1.predict(X_test)
print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_dt1))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_dt1))

print('Classification Report: ')
print(classification_report(y_test, y_pred_dt1))

print('Accuracy: ', accuracy_score(y_test, y_pred_dt1))

In [None]:
text_dt1 = tree.export_text(dt1)
print(text_dt1)

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt1, 
                   feature_names=X_dropped_sm.columns,  
#                    class_names=.target_names,
                   filled=True)

### Saving and Loading Models 

In [None]:
import pickle
# Save the trained model as a pickle string.
save_dt1 = pickle.dumps(dt1)

# Load the pickled model
load_dt1 = pickle.loads(save_dt1)
  
# Use the loaded pickled model to make predictions
load_dt1.predict(X_test)

result = load_dt1.score(X_test, y_test)
print(result)

--------------------------------------------------------------

## Decision tree 2 with Grid Search CV

In [None]:
dt1.get_params()

In [None]:
param = {"max_features": [1,2,4,6], 
    "max_depth": [1, 3, 5, 9] }

In [None]:
dec = DecisionTreeClassifier()
gs = GridSearchCV(dec, param, cv = 5) # cv: cross validation
gs.fit(X_train, np.ravel(y_train, order = 'C')) 

In [None]:
y_pred = gs.predict(X_test)

print('Roc Auc Score: ', roc_auc_score(y_test, y_pred))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred))

print('Classification Report: ')
print(classification_report(y_test, y_pred))

print('Accuracy: ', accuracy_score(y_test, y_pred))

In [None]:
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    
    for mean, std, params in zip(mean_score, std_score, params):
        print(f'{round(mean, 3)} + or - {round(std, 3)} for the {params}')
        
display(gs)

In [None]:
dtt = DecisionTreeClassifier(max_depth = 9, max_features = 6)
scores = cross_val_score(dtt, X_train, y_train, cv = 10)
scores.mean()

In [None]:
dtt.fit(X_train, y_train)

In [None]:
y_pred_dtt = dtt.predict(X_test)
print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_dtt))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_dtt))

print('Classification Report: ')
print(classification_report(y_test, y_pred_dtt))

print('Accuracy: ', accuracy_score(y_test, y_pred_dtt))

----------------------

## Random Forest 1 Basic

In [None]:
rf = RandomForestClassifier()
scores = cross_val_score(rf, X_train, y_train, cv = 10)
scores 

In [None]:
rf.fit(X_train, y_train)

In [None]:
y_pred_rf0 = rf.predict(X_test)

print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_rf0))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_rf0))

print('Classification Report: ')
print(classification_report(y_test, y_pred_rf0))

print('Accuracy: ', accuracy_score(y_test, y_pred_rf0))

In [None]:
save_rf0 = pickle.dumps(rf)
# load_ref1 = pickle.loads(save_rf1)

In [None]:
rf.get_params()

## Random Forest 2 with Grid Search CV

In [None]:
RF = RandomForestClassifier()

In [None]:
# Grid search cv
param = {
    "n_estimators": [50, 100, 150, 200], # number of trees in the forest 
    'max_features': ['auto', 'sqrt', 'log2'], # max features to consider when looking for the best split
    "max_depth": [1, 3, 5, 7, 9], # The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves 
                                  # contain less than min_samples_split samples.
    "criterion": ['gini', 'entropy']
}

gscv = GridSearchCV(RF, param, cv = 5)
gscv.fit(X_train, np.ravel(y_train, order = 'C'))

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    
    for mean, std, params in zip(mean_score, std_score, params):
        print(f'{round(mean, 3)} + or - {round(std, 3)} for the {params}')
        
display(gscv)

In [None]:
print('Best Score: ', gscv.best_score_)
print('Best params: ', gscv.best_params_)

In [None]:
y_pred_gscv = gscv.predict(X_test)

print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_gscv))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_gscv))

print('Classification Report: ')
print(classification_report(y_test, y_pred_gscv))

print('Accuracy: ', accuracy_score(y_test, y_pred_gscv))

In [None]:
save_rf1 = pickle.dumps(gscv) # NOTE rf1: gscv
# load_rf2 = pickle.loads(save_rf2)

------------------------------------------

## Gradient Boosting Classifier 1 Basic 

In [None]:
gbc = GradientBoostingClassifier()
scores = cross_val_score(gbc, X_train, y_train, cv = 10)
scores 

In [None]:
gbc0 = gbc.fit(X_train, np.ravel(y_train)) # np.ravel(y_train, order = 'C') 

In [None]:
y_pred_gbc0 = gbc0.predict(X_test)

print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_gbc0))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_gbc0))

print('Classification Report: ')
print(classification_report(y_test, y_pred_gbc0))

print('Accuracy: ', accuracy_score(y_test, y_pred_gbc0))

In [None]:
save_gbc0 = pickle.dumps(gbc0)

## Gradient Boosting Classifier 2 Hyper parameter tuning with grid search cv 

In [None]:
gbc.get_params()

In [None]:
# Grid search cv
gbc1 = GradientBoostingClassifier()
param = {
    "n_estimators": [50, 100, 150, 200], # number of trees in the forest 
    'max_features': ['auto', 'sqrt', 'log2'], # max features to consider when looking for the best split
    "max_depth": [1, 3, 5, 7, 9], # The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves 
                                  # contain less than min_samples_split samples.
    "learning_rate": [0.01, 0.1, 1, 10, 100]
}

gs_gb = GridSearchCV(gbc1, param, cv = 5)
gs_gb.fit(X_train, np.ravel(y_train, order = 'C'))

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    
    for mean, std, params in zip(mean_score, std_score, params):
        print(f'{round(mean, 3)} + or - {round(std, 3)} for the {params}')
        
display(gs_gb)

# best accuracy: 95.4

In [None]:
gbc1 = GradientBoostingClassifier(learning_rate= 1, max_depth = 9, max_features = 'auto', n_estimators = 200)
gbc1.fit(X_train, np.ravel(y_train)) # np.ravel(y_train, order = 'C') 

In [None]:
y_pred_gbc1 = gbc1.predict(X_test)

print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_gbc1))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_gbc1))

print('Classification Report: ')
print(classification_report(y_test, y_pred_gbc1))

print('Accuracy: ', accuracy_score(y_test, y_pred_gbc1))

-------------------------------------

## MODELS: KNN
Need feature scaling, dummy variables and outlier management

#### KNN 1: Basic model without feature scaling and one hot encoding

In [None]:
knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv = 10)
scores

In [None]:
knn1 = knn.fit(X_train, y_train)

In [None]:
y_pred_knn1 = knn1.predict(X_test)
print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_knn1))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_knn1))

print('Classification Report: ')
print(classification_report(y_test, y_pred_knn1))

print('Accuracy: ', accuracy_score(y_test, y_pred_knn1))

### Dummy variables for nominal features 

In [None]:
X_train_en = pd.concat([X_train, pd.get_dummies(X_train['gender_l'], prefix = 'gender')], axis = 1)
X_train_en.drop('gender_l', axis = 1, inplace = True)
X_train_en

In [None]:
X_test_en = pd.concat([X_test, pd.get_dummies(X_test['gender_l'], prefix = 'gender')], axis = 1)
X_test_en.drop('gender_l', axis = 1, inplace = True)
X_test_en.head()

### Feature Scaling 

In [None]:
X_train_en.head()

In [None]:
X_train.columns

### The data is too large for shapiro test, so using D’Agostino’s K^2 Test

In [None]:
from scipy.stats import normaltest 
for column in ['rel', 'inactive', 'contacts', 'amt_change', 'trans_amt','trans_count_change', 'util']:
    stat, p = normaltest(X_train[column])
    print(stat, ":", p)

    # Interpret if gaussian then go for standard scaler else we choose between std or min max scalers
    # if p value < 0.05 (los) => data not Gaussian

    if p> 0.05:
        print('Sample seems Gaussian (fail to reject Ho)')
    else:
        print('Sample does not seem Gaussian (reject Ho)')

In [None]:
sc = StandardScaler()
mm = MinMaxScaler()
for column in ['rel', 'inactive', 'contacts', 'amt_change', 'trans_amt','trans_count_change', 'util']:
    col_sc = sc.fit((np.asarray(X_train_en[column]).reshape(-1, 1)))# note for scaling we need array and we need to reshape
    X_train_en[f'{column}_sc'] = col_sc.transform(X_train_en[column].values.reshape(-1, 1))
    X_test_en[f'{column}_sc'] = col_sc.transform(X_test_en[column].values.reshape(-1, 1))
    
    col_mm = mm.fit((np.asarray(X_train_en[column]).reshape(-1, 1)))
    X_train_en[f'{column}_mm'] = col_mm.transform(X_train_en[column].values.reshape(-1, 1))
    X_test_en[f'{column}_mm'] = col_mm.transform(X_test_en[column].values.reshape(-1, 1))

In [None]:
X_train = X_train_en[['gender_0', 'gender_1', 'rel_sc',
       'inactive_sc', 'contacts_sc', 'amt_change_sc', 'trans_amt_sc',
       'trans_count_change_sc', 'util_sc']]
X_test = X_test_en[['gender_0', 'gender_1', 'rel_sc',
       'inactive_sc', 'contacts_sc', 'amt_change_sc', 'trans_amt_sc',
       'trans_count_change_sc', 'util_sc']]

### KNN 2: includes dummy variables and standard scaler 

In [None]:
knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv = 10)
scores 

In [None]:
knn2 = knn.fit(X_train, y_train)

In [None]:
y_pred_knn2 = knn2.predict(X_test)
print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_knn2))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_knn2))

print('Classification Report: ')
print(classification_report(y_test, y_pred_knn2))

print('Accuracy: ', accuracy_score(y_test, y_pred_knn2))

### KNN 3: with MinMaxScaler

In [None]:
X_train = X_train_en[['gender_0', 'gender_1', 'rel_mm',
       'inactive_mm', 'contacts_mm', 'amt_change_mm', 'trans_amt_mm',
       'trans_count_change_mm', 'util_mm']]
X_test = X_test_en[['gender_0', 'gender_1', 'rel_mm',
       'inactive_mm', 'contacts_mm', 'amt_change_mm', 'trans_amt_mm',
       'trans_count_change_mm', 'util_mm']]

In [None]:
knn3 = knn.fit(X_train, y_train)

In [None]:
y_pred_knn3 = knn3.predict(X_test)
print('Roc Auc Score: ', roc_auc_score(y_test, y_pred_knn3))

print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred_knn3))

print('Classification Report: ')
print(classification_report(y_test, y_pred_knn3))

print('Accuracy: ', accuracy_score(y_test, y_pred_knn3))

**We note that the KNN model behaves equally well for both scalers. However, we see the improvement in accuracy due to one hot encoding and feature scaling. The accuracy score seems to have improved from approximately 86% to 89%.**

### KNN 4: GridSearch CV to find the best value of K

In [None]:
X_train.shape

In [None]:
# We try the model with the conventionally accepted best value of k: square root of the test size ( ~ 113)
knn_ = KNeighborsClassifier(n_neighbors = 113)
knn40 = knn_.fit(X_train, y_train)

In [None]:
y_pred_knn40 = knn40.predict(X_test)

In [None]:
result = confusion_matrix(y_test, y_pred)
print(result)
result2 = accuracy_score(y_test, y_pred)
print(result2)

In [None]:
print(roc_auc_score(y_test, y_pred))

In [None]:
knn.get_params()

In [None]:
# grid search to find the best value of k
parameters = {'n_neighbors' : [5, 10, 50, 100]}
grid = GridSearchCV(knn, parameters, cv=10, scoring='accuracy')
kgscv = grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_score_

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
print(cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy').mean())

## Model Evaluation, Comparison and Conclusion

In [None]:
# considering the best version of the models 

comparison = pd.DataFrame()
# comparison.set_index(['Decision Tree', 'Random Forest', 'Gradient Boosting', 'K Nearest Neighbors'])
c = 0

for y_pred in [y_pred_dt1, y_pred_rf0, y_pred_gbc1, y_pred_knn2]:
    
#     y_pred = model.predict(X_test)
    
    roc_auc = roc_auc_score(y_test, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    acc = accuracy_score(y_test, y_pred)
    
    comparison.at[c, 'roc_auc'] = roc_auc
    comparison.at[c, 'tn'] = tn
    comparison.at[c, 'fp'] = fp
    comparison.at[c, 'fn'] = fn
    comparison.at[c, 'tp'] = tp
    comparison.at[c, 'precision'] = precision
    comparison.at[c, 'recall'] = recall
    comparison.at[c, 'accuracy'] = acc
    
    c = c+1 

comparison['models'] = ['Decision Tree', 'Random Forest', 'Gradient Boost Classifier', 'KNN']
comparison.set_index('models', inplace = True)

fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_pred_dt1)
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_pred_rf0)
fpr_gbc, tpr_gbc, thresholds_gbc = roc_curve(y_test, y_pred_gbc1)
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, y_pred_knn2)



plt.plot(fpr_dt, tpr_dt, label = 'Decision Tree'%roc_auc_score(y_test, y_pred))
plt.plot(fpr_rf, tpr_rf, label = 'Random Forest'%roc_auc_score(y_test, y_pred))
plt.plot(fpr_gbc, tpr_gbc, label = 'Gradient Boost Classifier'%roc_auc_score(y_test, y_pred))
plt.plot(fpr_knn, tpr_knn, label = 'KNN'%roc_auc_score(y_test, y_pred))
plt.plot([0,1], [0,1], 'r--')
plt.legend(loc = 'lower right')  
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve Comparison')


comparison.transpose().loc[['tn', 'fp', 'fn', 'tp']].plot(kind = 'bar')


comparison.transpose().loc[['roc_auc', 'precision', 'recall', 'accuracy']].plot(kind = 'bar')
plt.legend(loc = 'lower right')
plt.show()


comparison

### The ROC (Reciever Operating Characteristic) Curve 
- The curve is obtained by plotting the true positive rate (tpr) on the y-axis and the false positive rate (fpr) on the x-axis
- It is known for its display of how well binary classifiers perform under different threshold settings and allows us to compare the performance of various classification models against each other.
- Hence this curve is a good way of comparing and to eventually choosing the best classifier to model our binary classification problem of predicting customer attrition behaviour.
- The more the area under the curve, the better the model is said to be.

![ROC](http://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roc-draft-xkcd-style.svg)

## Insights and Conclusions

### Confusion Matrix, Type 1 and Type 2 Errors
- The confusion matrix records the number of true positives, false positives (type 1 error), true negatives and false negatives (type 2 errors) and hence gives a clear indication about how well the model performs.
- While in most cases type 1 errors are considered more serious, however, in our case false negatives, that is, incorrectly predicting that a customer will not leave the bank is a more serious error. This is because such a false prediction would fail to warn the bank and would keep them from pursuing preventive measues. 
- Whereas, the type 1 error also has its down sides, here a false positive implies incorrectly predicting that a customer will leave the bank. In such cases, the bank will need to look into the customer and pursue possibly unecessary preventive measures, which might lead to loss of time and resources. 
- **Precision:** = tp / (tp + fp) [higher the value, lower the number of false positives]
- **Recall** = tp / (tp + fn) [higher the value, lower the number of false negatives] 
- **Since, in our case we seek to minimise the number of false negatives, hence we try to ensure a high recall value. Amongst our models, the random forest appears to have the highest recall value, although the precision is even higher.**

### Accuracy = (tp+tn)/(tp+tn+fp+fn)
- Since we have balanced out the dataset, accuracy is a good evaluation metric. Again, the closer the accuracy value is to 1, the better the model performs.

### Model insights:
- **Tree Based Models:** 
    - Decision trees and random forests are used for both regression as well as classification tasks. Since, our problem consists of predicting one out of 2 (finite) possible outcomes viz. the customer leaves or the customer does not leave, hence ours is a binary classification model.
    - One of the most popular advantages of using tree based algorithms is their immunity towards outliers and variation in the data. Hence these algorithms do not necessarily need outlier treatment and feature scaling. **Since our data does have a significant number of outliers, tree based algorithms can help avoid going through the time consuming outlier handling process.**
    - Our data consists of both continuous as well as categorical data however, trees can handle qualitative variables without the need to create dummy variables. 
    - **Since random forests and gradient boost are both ensemble methods, that use multiple decision trees, they prevent overfitting as well as have a greater acccuracy as is seen in our case.**
    
    
- **K Nearest Neighbour Classifier**
    - The algorithm has its foundation on calculating distances between input point and training points and choosing K closest points to classifiy the input point.
    - There are different kinds of distance formulas this algorithm can use such as the Euclidean distance, Manhatten distance etc. **However, it is clear that this algorithm can be computationally expensive when dealing with large datasets, as in our case.** The larger the dataset, the more is the time and cost of its implementation, and hence its degraded peformance.
    - **Again, since this algorithm involves computing distances between points, dimensionality reduction and feature scaling are necessary. In addition, the KNN classifier is highly sensitive to outliers and missing values, thus making it completely necessary to carry out all these preprocessing steps.**
    - The most important step this algorithm involves is to identify the best value of k. This involves various trial and error runs or using a hyper parameter tuning technique such as grid search cv, thus further adding to the heavy computation it involves.
    - All the above factors may have had an important role to play in its less impressive performance in our use case. 
    - **We learn that KNN might not be the best algorithm for data that is big in size or dimension or that has significant nummber of missing values and/or outliers (as in our case).**
    

### Conclusion
Having considered various evaluation metrics and the ROC curve, we can conclude that the gradient boosting classifier is the best performing algorithm for our data amongst the other explored models to study and predict customer churn behaviour. This model can help the bank predict which customers might consider leaving. With such information, the bank can proactively prevent the customer from leaving by addressing the issues the customers might be facing that incentivized them to consider leaving.

### References

- https://machinelearningmastery.com/feature-selection-machine-learning-python/
- https://towardsdatascience.com/mistakes-in-applying-univariate-feature-selection-methods-34c43ce8b93d

- SMOTE oversampling technique for imbalanced datasets: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

- Imbalanced datasets:
    - https://machinelearningmastery.com/what-is-imbalanced-classification/
    - https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#Oversampling_techniques_for_classification_problems
    - https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
    - https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
    

- https://analyticsindiamag.com/7-types-classification-algorithms/
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
- https://scikit-learn.org/stable/modules/tree.html
- revolving balance: https://www.creditcards.com/credit-card-news/glossary/term-revolvingbalance/#:~:text=In%20credit%20card%20terms%2C%20a,borrowed%20and%20the%20amount%20repaid.

- comparing models: https://www.kaggle.com/klaudiajankowska/binary-classification-multiple-method-comparison?scriptVersionId=35642967
- normality tests: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
- random forest documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html