<h2 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Credit Card Users Churn Prediction</h2>

What is Customer Churn?
Customer churn means a customer’s ending their relationship with a bank/company for any reason. Although churn is inevitable at a certain level, a high customer churn rate is a reason for failing to reach the business goals. So identifying customers who would churn is very important for business

<h2 style = "font-family:TimesNewRoman;color:black;font-weight:bold">Table of Contents</h2>

- [Context](#Context) 
- [Data Dictionary](#Data-Dictionary)
- [Problem](#Problem)
- [Libraries](#Libraries)
- [Read and Understand Data](#Read-and-Understand-data)
- [Data Preprocessing](#Data-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis) 
- [Insights based on EDA](#Insights-based-on-EDA)
- [Missing value Detection and Treatment](#Missing-value-Detection-and-Treatment)   
- [Outlier Detection](#Outlier-Detection)
- [Model Building Logistic Regression](#Model-Building-Logistic-Regression)
- [HyperParameter Tuning](#Hyperparameter-Tuning) 
- [Conclusion](#Conclusion) 
- [Business Recommendations & Insights](#Business-Recommendations-&-Insights)
 



<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Context</h3>

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

**Objective**
- Explore and visualize the dataset.
- Build a classification model to predict if the customer is going to churn or not
- Optimize the model using appropriate techniques
- Generate a set of insights and recommendations that will help the bank


<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Data Dictionary</h3>

- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age: Age in Years
- Gender: Gender of the account holder
- Dependent_count: Number of dependents
- Education_Level:  Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
- Marital_Status: Marital Status of the account holder
- Income_Category: Annual Income Category of the account holder
- Card_Category: Type of Card
- Months_on_book: Period of relationship with the bank
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
- Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Trans_Ct: Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
- Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
- Avg_Utilization_Ratio: Represents how much of the available credit the customer spent


**What Is a Revolving Balance?**

If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance

**What is the Average Open to buy?**

'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.

**What is the Average utilization Ratio?**

The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.

**Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:**

( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1


<h2 style = "font-family:TimesNewRoman;color:black;font-weight:bold">Problem</h2> 

- Does Income has any effect on Attrition .?
- Does Sex has any relation on Attrition.?
- What are the signs of attrition .?



[Top](#Table-of-Contents)

<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Libraries</h3>

In [None]:
!pip install imblearn

In [None]:
### IMPORT: ------------------------------------
import scipy.stats as stats 
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings

import statsmodels.api as sm
#--Sklearn library--
from sklearn.model_selection import train_test_split,StratifiedKFold, cross_val_score # Sklearn package's randomized data splitting function

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
    StackingClassifier
)

from xgboost import XGBClassifier
from sklearn import tree
from sklearn.linear_model import LogisticRegression

# To impute missing values
from sklearn.impute import KNNImputer
# Libtune to tune model, get different metric scores

from sklearn.metrics import  classification_report, accuracy_score, precision_score, recall_score,f1_score
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay,plot_confusion_matrix #to plot confusion matric

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_colwidth',200)
# To supress numerical display in scientific notations
pd.set_option('display.float_format', lambda x: '%.5f' % x) 
warnings.filterwarnings('ignore') # To supress warnings
 # set the background for the graphs
plt.style.use('ggplot')
# For pandas profiling
from pandas_profiling import ProfileReport
print('Load Libraries-Done')



<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Read and Understand data</h3>

In [None]:
#Reading the Excel file  used tourism.xlsx 
data_path='../input/bankcurners/BankChurners.csv'

df=pd.read_csv(data_path)


df_credit=df.copy()
print(f'There are {df_credit.shape[0]} rows and {df_credit.shape[1]} columns') # fstring 

In [None]:
# View the first  5 rows of the dataset.
df_credit.head()

In [None]:
# last 5 rows
df_credit.tail()

In [None]:
#Understand the  dataset.
#get the size of dataframe
print ("Rows     : " , df_credit.shape[0])  #get number of rows/observations
print ("Columns  : " , df_credit.shape[1]) #get number of columns
print ("#"*40,"\n","Features : \n\n", df_credit.columns.tolist()) #get name of columns/features
missing_df = pd.DataFrame({
    "Missing": df_credit.isnull().sum(),
    "Missing %": round((df_credit.isnull().sum()/ df_credit.isna().count()*100), 2)
})
display(missing_df.sort_values(by='Missing', ascending=False))

In [None]:
#### Check the data types of the columns for the dataset.
df_credit.info()

**Observations**

- Customer_Age,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct, Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio, are all continous varaibles, rest are categorical variables
- Attrition_Flag is the Target variable.
- There are no missing values.

#### Summary of the dataset.

In [None]:
df_credit.describe().T

**Observations**
- Average customer age is ~46 and max customer age is 73.
- Average period of relationship with the bank is ~35 months with minimum of 13 and max as 56.
- Maximum Total number of product  held by customer is 6 and on average is ~4.
- Mean Credit_limit 8631 while median is 4549 , indicates data has outliers and right skewed.
- Total_Revolving_Bal has mean as 1162.
- Avg_Open_To_Buy has mean 7469 and max as 34516 .This number had appeared before in credit limit. This seems to be some default value. Distrubution is right skewed with some outliers on higher end.
- Total_Amt_Chng_Q4_Q1 median is 0.73600 and mean is 0.75994.
- Total_Trans_Amt has an average of 4404 and median of 3899. This indicate data is right skewed with outliers on higher end 
- Total_Trans_Ct has an average value of 64.8 and median of 67. This ndicates slight skewness to the right.
- Total_Ct_Chng_Q4_Q1 has an average of 0.71 and median value of 0.702.
- Avg_Utilization_Ratio is right skewed with an average of 0.27 and median at 0.176.


[Top](#Table-of-Contents)

<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Data Preprocessing</h3>

#### Droping CLIENTNUM

In [None]:
df_credit.drop(['CLIENTNUM'],axis=1,inplace=True)

In [None]:

cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']


In [None]:

for col in cat_cols:
    print(f"Feature: {col}")
    print("-"*40)
    display(pd.DataFrame({"Counts": df_credit[col].value_counts(dropna=False)}).sort_values(by='Counts', ascending=False))

**Observations**
- 1657 customers has attrited.
-  Education level,Income,martial status has `Unknown` category , this will have to be treated as missing value and will have to be imputed.
- Blue card has maxiumum customers.

In [None]:
## Converting the data type of categorical features to 'category'

cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']

df_credit[cat_cols] = df_credit[cat_cols].astype('category')
df_credit.info()

In [None]:
df_credit.describe(include=['category']).T

#### Age

Age can be a vital factor in tourism, converting ages to bin to explore if there is any pattern

In [None]:
df_credit.Customer_Age.describe()

In [None]:
df_credit['Agebin'] = pd.cut(df_credit['Customer_Age'], bins = [25, 35,45,55,65, 75], labels = ['25-35', '36-45', '46-55', '56-65','66-75'])

In [None]:
df_credit.Agebin.value_counts()

[Top](#Table-of-Contents)

<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Exploratory Data Analysis</h3>

In [None]:
def dist_box(data):
 # function plots a combined graph for univariate analysis of continous variable 
 #to check spread, central tendency , dispersion and outliers  
    Name=data.name.upper()
    fig,(ax_box,ax_dis)  =plt.subplots(nrows=2,sharex=True,gridspec_kw = {"height_ratios": (.25, .75)},figsize=(8, 5))
    mean=data.mean()
    median=data.median()
    mode=data.mode().tolist()[0]
    sns.set_theme(style="white")
    fig.suptitle("SPREAD OF DATA FOR "+ Name  , fontsize=18, fontweight='bold')
    sns.boxplot(x=data,showmeans=True, orient='h',color="tan",ax=ax_box)
    ax_box.set(xlabel='')
     # just trying to make visualisation better. This will set background to white
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    sns.distplot(data,kde=False,color='red',ax=ax_dis)
    ax_dis.axvline(mean, color='r', linestyle='--',linewidth=2)
    ax_dis.axvline(median, color='g', linestyle='-',linewidth=2)
    plt.legend({'Mean':mean,'Median':median})
                    

In [None]:
#select all quantitative columns for checking the spread
list_col=  df_credit.select_dtypes(include='number').columns.to_list()
for i in range(len(list_col)):
    dist_box(df_credit[list_col[i]])

**Observations**
-  Customer Age is almost Normally disturbuted, with some outlier on higher end.
- Month on book has maximum distrubution around ~35-36.Most customer have credit card for this long. It has many outliers on lower and higher end.
- Credit card limit is right skewed , with a sudden pick at 35000, as seen before this is maxiumum limit and seems to be some kind of default value.There are lot of outliers on higher end. Customers above 25000 need to beinvestigated further.
- Total Revolving bal seems to have different  disturbution with many customers with ~0 revolving balance and then it follows almost normal distrubution and then  sudden peak at 2500.
- Average open to buy has same distribution as Credit card limit.
- Total Amt change  has lot of outliers on lower and upper end. There are some 3.5 ratio of  total amount change from Q4 to Q1,this  customers which needs to be investigated further.
- Total trans amt also has very different distrubution  with data between 0 -2500 , then 2500-5000, and then 750-10000 and then 12500-17500. It has lot of outliers on higher end.
- Total_trans_ct also has 3 modal with outliers on higher end.
- Total ct change q4_q1 has normal like disturbtion with lot of outliers on higher and lower end.
-  Avg_utlization ration is measure of  how much of the given credit limit is the customer actually using. it ranges from 0.0 to 1


In [None]:
# Making a list of all categorical variables

plt.figure(figsize=(15,20))

sns.set_theme(style="white") 
for i, variable in enumerate(cat_cols):
                     plt.subplot(9,2,i+1)
                     order = df_credit[variable].value_counts(ascending=False).index   
                     #sns.set_palette(list_palette[i]) # to set the palette
                     sns.set_palette('twilight_shifted')
                     ax=sns.countplot(x=df_credit[variable], data=df_credit )
                     sns.despine(top=True,right=True,left=True) # to remove side line from graph
                     for p in ax.patches:
                           percentage = '{:.1f}%'.format(100 * p.get_height()/len(df_credit[variable]))
                           x = p.get_x() + p.get_width() / 2 - 0.05
                           y = p.get_y() + p.get_height()
                           plt.annotate(percentage, (x, y),ha='center')
                     plt.tight_layout()
                     plt.title(cat_cols[i].upper())
                                     


**Observations**
- ~16% of credit card customers attrited.
-  ~ 52 % are female customers who have credit cards.
- ~ 30 % customers are graduate. There are very few post graduate and doctorate customers.
- ~46 % are married customers. 7.4 % unknown status needs to be imputed.
- ~ 35% earn less than 40 k.
- ~ 93 % have blue card. Very less customers have a plantinum card.
- ~22 % have more than 3 bank products 
- ~38 % are inactive from 3 months. Customers who are inactive from 4,5,6 month should be investigated more to see if there is any relationship with attrition
- ~60 % for contacted 2-3 times in 12 month.

In [None]:
sns.set_palette(sns.color_palette("Set2", 8))
plt.figure(figsize=(15,12))
sns.heatmap(df_credit.corr(),annot=True)
plt.show()

In [None]:
sns.set_palette(sns.color_palette("Set1", 8))
sns.pairplot(df_credit, hue="Attrition_Flag",corner=True)
plt.show()

**Observations**
- Customer age and number of books are highly correlated.
- credit limit and Avg utlization ration has some negative correlation.
- Total revolving balance and average utlization are positively correlated.
- Average opening balance is negatively correlated to avg utlization ratio.
- There is very little correlation between total transfer amount and credit limit
- As expected there is very high correlation total transfer amount and total transfer count.
- Credit limit and Average open to buy is fully correlated, we can drop one of them.
- It is also logical that Total_Trans_Amt  is correlated  to Total_Amt_Chng_Q4_Q1,total ct_change_q4_Q1 . These features seems to be derived from Total_Trans_Amt. May be we can drop one of these columns.

In [None]:
### Function to plot distributions and Boxplots of customers
def plot(x,target='Attrition_Flag'):
    fig,axs = plt.subplots(2,2,figsize=(12,10))
    axs[0, 0].set_title(f'Distribution of {x} \n of a existing customer',fontsize=12,fontweight='bold')
    sns.distplot(df_credit[(df_credit[target] == 'Existing Customer')][x],ax=axs[0,0],color='teal')
    axs[0, 1].set_title(f"Distribution of {x}\n of a  attrited customer ",fontsize=12,fontweight='bold')
    sns.distplot(df_credit[(df_credit[target] == 'Attrited Customer')][x],ax=axs[0,1],color='orange')
    axs[1,0].set_title(f'Boxplot of {x} w.r.t attrited customer',fontsize=12,fontweight='bold')
    
    line = plt.Line2D((.1,.9),(.5,.5), color='grey', linewidth=1.5,linestyle='--')
    fig.add_artist(line)
   
    sns.boxplot(df_credit[target],df_credit[x],ax=axs[1,0],palette='gist_rainbow',showmeans=True)
    axs[1,1].set_title(f'Boxplot of {x} w.r.t Attrited customer - Without outliers',fontsize=12,fontweight='bold')
    sns.boxplot(df_credit[target],df_credit[x],ax=axs[1,1],showfliers=False,palette='gist_rainbow',showmeans=True) #turning off outliers from boxplot
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    plt.tight_layout(pad=4)
    plt.show()

In [None]:
#select all quantitative columns for checking the spread
#list_col=  ['Age','DurationOfPitch','MonthlyIncome']
list_col=df_credit.select_dtypes(include='number').columns.to_list()
#print(list_col)
#plt.figure(figsize=(14,23))
for j in range(len(list_col)):
   plot(list_col[j])
   

**Observation**
- There is no difference in Age, months on book,credit limit,average open to buy, of attrited and existing customers. it doesnt seem to have any relation with attrition. 
- It seems existing customers have a higher Total Revolving Balance than customers who attrited.
- Customers with lesser transaction amount spend and low change in transaction_spend_Q1_Q4 were more likely to attrite. 
- The customers with low number of transactions and low change in number of transactions between Q1 and Q4 attrited.
- On average, customers  with less utlization attrited.


In [None]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 8))

sns.barplot(y='Credit_Limit',x='Income_Category',hue='Attrition_Flag',data=df_credit)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Income vs credit')

In [None]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(y='Credit_Limit',x='Education_Level',hue='Attrition_Flag',data=df_credit)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Education')


In [None]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Credit_Limit',hue='Attrition_Flag',data=df_credit)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Credit limit')


In [None]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Total_Revolving_Bal',hue='Attrition_Flag',data=df_credit)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Total Revolving Balance')

In [None]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Total_Trans_Amt',hue='Attrition_Flag',data=df_credit)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Total Transcational Amount')

In [None]:

plt.figure(figsize=(10,5)) 
sns.barplot(y='Credit_Limit',x='Gender',hue='Attrition_Flag',data=df_credit) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Credit limit  vs Gender')

In [None]:

plt.figure(figsize=(10,5))
sns.barplot(y='Credit_Limit',x='Card_Category',hue='Attrition_Flag',data=df_credit) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Credit Limit  vs Card Category')

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(y='Total_Trans_Amt',x='Card_Category',hue='Attrition_Flag',data=df_credit) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Total Transcation Amount  vs Card')

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(y='Total_Trans_Ct',x='Card_Category',hue='Attrition_Flag',data=df_credit) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Total Transcation Count  vs Card Category')

**Observation**
- Customer with 35-55 were given more credit limit.
- Plantinum card holder had higher credit limit
- Customer earning more than 120 k had higher credit limit
- Male customer were given more credit limit than female

In [None]:
## Function to plot stacked bar chart
def stacked_plot(x):
    sns.set_palette(sns.color_palette("tab20", 8))
    tab1 = pd.crosstab(x,df_credit['Attrition_Flag'],margins=True)
    display(tab1)
    tab = pd.crosstab(x,df_credit['Attrition_Flag'],normalize='index')
    tab.plot(kind='bar',stacked=True,figsize=(9,5))
    plt.xticks(rotation=360)
    #labels=["No","Yes"]
    plt.legend(loc='lower left', frameon=False,)
    plt.legend(loc="upper left",title=" ",bbox_to_anchor=(1,1))
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    #plt.legend(labels)
    plt.show()

In [None]:
cat_cols.append("Agebin")
for i, variable in enumerate(cat_cols):
       stacked_plot(df_credit[variable])

**Observations**
- Female customer attrited more compared to male.
- Customers who were doctorate or postgraduate attrited most.
- Customers who were single attrited more.
- Customers who earned more than 120k and less than 40k.
- Customers with plantinum card attrited more but there are only 20 samples so this is inclusive. Customers with gold attrited more compared to blue and silver card. May be analyzing profile of customers with different card help us in identfying some pattern here.
- Custimer with 3 dependent attrited more.
- Customer having 1 or 2 bank product attrited more compared to customers with more bank products.
- Customers who were never inactive attrited most.we can't be sure about this we have only 29 samples.Customers who were inactive for 4 months attrited most followed by 3 month and 5 month.
- This is surpising , customer who were contacted most in last 12 month attrited.Did bank had any information about there attrition which was a reason bank was contacting those customers so many times.? or so much of contact from bank lead to attrition
- Customer in age range 66-75 attrited most , but this is inclusive since we have only 18 samples.Customer in age range 36-55 attrited more.


In [None]:
#Profile of Attrited Customer with Blue Card 
df_credit[(df_credit['Card_Category']=='Blue') & (df_credit['Attrition_Flag']=='Attrited Customer')].describe(include='all').T

In [None]:
#Profile of Attrited Customer with gold Card 
df_credit[(df_credit['Card_Category']=='Gold') & (df_credit['Attrition_Flag']=='Attrited Customer')].describe(include='all').T

In [None]:
#Profile of Attrited Customer with silver  Card 
df_credit[(df_credit['Card_Category']=='Silver') & (df_credit['Attrition_Flag']=='Attrited Customer')].describe(include='all').T

In [None]:
#Profile of Attrited Customer with platinum Card 
df_credit[(df_credit['Card_Category']=='Platinum') & (df_credit['Attrition_Flag']=='Attrited Customer')].describe(include='all').T

**Profile of customer who attrited most based on there card type**
- #### Blue Card
    - Most likely  Female who were married ,  age group 46-55  and earning less than 40 k, Education level graduate and dependent member 3 , total bank product 3 and were inactive for 3 months. There average utilzation ratio was very low
    
- #### Gold Card
    - Most likely Male who are single , between age group 36-45  earning 60- 80k, education level graduate and inactive for 3 months
- #### Silver Card
    - Most likely Male who are single , between age group 46-55 , earned between 80 k -120 k ,education level graduate and inactive for 3 months
- #### Platinum card
    - Most likely Female who were single , age group 46-55 ,earning less than 40 k , education level graduate and inactive for 3 months



<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Insights based on EDA</h3>

- ~16% customer attrited .
- Female customer attrited more compared to male.
- Customers who were single attrited more.
- Customers who earned more than 120k and less than 40k.
- Customers with plantinum card attrited more but there are only 20 samples so this is inclusive. Customers with gold attrited more compared to blue and silver card.
- Customer in age range 36-55 attrited more.
- Customers who were doctorate or postgraduate attrited most. 
- Surprising  Attrition has been higher when there is higher number of contacts with the Bank in the last 12 months.


[Top](#Table-of-Contents)

<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Outlier Detection</h3>


In [None]:
Q1 = df_credit.quantile(0.25)             #To find the 25th percentile and 75th percentile.
Q3 = df_credit.quantile(0.75)

IQR = Q3 - Q1                           #Inter Quantile Range (75th perentile - 25th percentile)

lower=Q1-1.5*IQR                        #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper=Q3+1.5*IQR

In [None]:
((df_credit.select_dtypes(include=['float64','int64'])<lower) | (df_credit.select_dtypes(include=['float64','int64'])>upper)).sum()/len(df_credit)*100

In [None]:
numeric_columns = df_credit.select_dtypes('number').columns.to_list()
# outlier detection using boxplot
plt.figure(figsize=(20,30))

for i, variable in enumerate(numeric_columns):
                     plt.subplot(4,4,i+1)
                     plt.boxplot(df_credit[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()

In [None]:
print(upper)

In [None]:
df_credit[df_credit['Credit_Limit'] > upper.Credit_Limit].sort_values(by='Credit_Limit',ascending=False ).count()

In [None]:
df_credit[df_credit['Credit_Limit']== 34516.00000].count() # had seen this number during EDA so verifying


508 customer have credit limit at 34516, it seems to be some default value.

In [None]:
df_credit[df_credit['Total_Trans_Amt'] > upper.Total_Trans_Amt].sort_values(by='Total_Trans_Amt',ascending=False ).head(10)

896 customers has transcational amount greater than 8619.25000.With number of transcation count this data seems to be correct.

In [None]:
df_credit[df_credit['Avg_Open_To_Buy'] > upper.Avg_Open_To_Buy].sort_values(by='Avg_Open_To_Buy',ascending=False ).head(10)

Not treating outliers here, and want alogorthims to learn about this outliers.

<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #0667BB; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px">Missing Value Detection & Treatment</h3>


There are Unknown values for the columns Education_Level,Marital_Status & Income_Category which can be treated as missing values. Replacing Unknown with nan

In [None]:
df_credit = df_credit.replace({'Unknown': None})


In [None]:
df_credit.isnull().sum()

[Top](#Table-of-Contents)

In [None]:
df_credit.info()

### Missing-Value Treatment

* We will use KNN imputer to impute missing values.
 k-Nearest Neighbours (kNN)  identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations.
* `KNNImputer`: Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.
* KNN imputer replaces missing values using the average of k nearest non-missing feature values.
* Nearest points are found based on euclidean distance.

In [None]:
# Label Encode categorical variables  
attrition = {'Existing Customer':0, 'Attrited Customer':1}
df_credit['Attrition_Flag']=df_credit['Attrition_Flag'].map(attrition)

marital_status = {'Married':1,'Single':2, 'Divorced':3}
df_credit['Marital_Status']=df_credit['Marital_Status'].map(marital_status)


education = {'Uneducated':1,'High School':2, 'Graduate':3, 'College':4, 'Post-Graduate':5, 'Doctorate':6}
df_credit['Education_Level']=df_credit['Education_Level'].map(education)

income = {'Less than $40K':1,'$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4, '$120K +':5}
df_credit['Income_Category']=df_credit['Income_Category'].map(income)



In [None]:
imputer = KNNImputer(n_neighbors=5)

In [None]:
reqd_col_for_impute = ['Income_Category','Education_Level','Marital_Status']

### Split the dataset

* Since we have a significant imbalance in the distribution of the target classes, we will use stratified sampling to ensure that relative class frequencies are approximately preserved in train and test sets. 
* For that we will use the `stratify` parameter in the train_test_split function.
* Dropping Avg_open_to_buy has it is highly correlated with credit limit

In [None]:
# Separating target column
X = df_credit.drop(['Agebin','Attrition_Flag','Avg_Open_To_Buy'],axis=1)
#X = pd.get_dummies(X,drop_first=True)
y = df_credit['Attrition_Flag']

In [None]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1,stratify=y)
X_train.shape, X_test.shape

In [None]:
#Fit and transform the train data
X_train[reqd_col_for_impute]=imputer.fit_transform(X_train[reqd_col_for_impute])

#Transform the test data 
X_test[reqd_col_for_impute]=imputer.transform(X_test[reqd_col_for_impute])

In [None]:
#Checking that no column has missing values in train or test sets
print(X_train.isnull().sum())
print('-'*30)
print(X_test.isnull().sum())

In [None]:
## Function to inverse the encoding
def inverse_mapping(x,y):
    inv_dict = {v: k for k, v in x.items()}
    X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
    X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')

In [None]:

inverse_mapping(education,'Education_Level')
inverse_mapping(marital_status,'Marital_Status')
inverse_mapping(income,'Income_Category')



### Encoding categorical variables

In [None]:
X_train=pd.get_dummies(X_train,drop_first=True)
X_test=pd.get_dummies(X_test,drop_first=True)
print(X_train.shape, X_test.shape)

In [None]:
X_train

In [None]:
# # defining empty lists to add train and test results 
model_name=[]
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

def make_confusion_matrix(y_actual,y_predict,title):
    '''Plot confusion matrix'''
    fig, ax = plt.subplots(1, 1)
    
    cm = confusion_matrix(y_actual, y_predict, labels=[0,1])
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=["No","Yes"])
    disp.plot(cmap='Blues',ax=ax)
    
    ax.set_title(title)
    plt.tick_params(axis=u'both', which=u'both',length=0)
    plt.grid(b=None,axis='both',which='both',visible=False)
    plt.show()

In [None]:
def get_metrics_score(model,modelname,X_train_pass,X_test_df_pass,y_train_pass,y_test_pass):
    '''
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    '''
    # defining an empty list to store train and test results
    score_list=[]
    
    pred_train = model.predict(X_train_pass)
    pred_test = model.predict(X_test_df_pass)
    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)
    train_acc = accuracy_score(y_train_pass,pred_train)
    test_acc = accuracy_score(y_test_pass,pred_test)
    train_recall = recall_score(y_train_pass,pred_train)
    test_recall = recall_score(y_test_pass,pred_test)
    train_precision = precision_score(y_train_pass,pred_train)
    test_precision = precision_score(y_test_pass,pred_test)
    train_f1 = f1_score(y_train_pass,pred_train)
    test_f1 = f1_score(y_test_pass,pred_test)
    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
    model_name.append(modelname)  
    acc_train.append(score_list[0])
    acc_test.append(score_list[1])
    recall_train.append(score_list[2])
    recall_test.append(score_list[3])
    precision_train.append(score_list[4])
    precision_test.append(score_list[5])
    f1_train.append(score_list[6])
    f1_test.append(score_list[7])
    metric_names = ['Train_Accuracy', 'Test_Accuracy', 'Train_Recall', 'Test_Recall','Train_Precision',
                          'Test_Precision', 'Train_F1-Score', 'Test_F1-Score']
    cols = ['Metric', 'Score']
    records = [(name, score) for name, score in zip(metric_names, score_list)]
    display(pd.DataFrame.from_records(records, columns=cols, index='Metric').T)
    # display confusion matrix
    make_confusion_matrix(y_train_pass,pred_train,"Confusion Matrix for Train")     
    make_confusion_matrix(y_test_pass,pred_test,"Confusion Matrix for Test") 
    return score_list # returning the list with train and test scores

<h1 style = "font-family:TimesNewRoman;color:black;font-weight:bold">Model Building </h1> 

### Model evaluation criterion:

#### Model can make wrong predictions as:
1. Predicting a customer will churn  but he does not - Loss of resources
2. Predicting a customer will not churn the services but he does - Loss of income

#### Which case is more important? 
* Predicting that customer will not churn but he does i.e. losing on a potential source of income for the bank  . Bank can  taken actions to stop these customer from churning.

#### How to reduce this loss i.e need to reduce False Negatives?
* Banks wants Recall to be maximized, greater the Recall lesser the chances of false negatives  means lesser chances of predicting customers will not churn when in reality they do.

# Model Building Logistic Regression

In [None]:
#Initialize model using pipeline
pipe_lr = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

#Fit on train data
pipe_lr.fit(X_train,y_train)

In [None]:
lr_score=get_metrics_score(pipe_lr,'LogisticRegression',X_train,X_test,y_train,y_test)

**Let's evaluate the model performance by using KFold and cross_val_score**

K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.

In [None]:
#Evaluate the model performance by using KFold and cross_val_score
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)     #Setting number of splits equal to 5
lr_cv_result=cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)

#Plotting boxplots for CV scores of model defined above
plt.boxplot(lr_cv_result)
plt.show()

* Performance on training set is in range  between 0.58 to 0.66 recall with the average recall being 0.61

**Handling Imbalanced dataset**

This is an Imbalanced dataset .A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the **`Synthetic Minority Oversampling Technique`**, or SMOTE for short.

### Over Sampling
Since dataset is imbalanced  let try oversampling using SMOTE and see if performance can be improved.

In [None]:
print(f"Before UpSampling, counts of label attrited customer: {sum(y_train==1)}")
print(f"Before UpSampling, counts of label existing customer: {sum(y_train==0)} \n")

sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)   #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train.ravel())

print(f"After UpSampling, counts of label attrited customer: {sum(y_train_over==1)}")
print(f"After UpSampling, counts of label existing customer: {sum(y_train_over==0)} \n")

print(f'After UpSampling, the shape of train_X: {X_train_over.shape}')
print(f'After UpSampling, the shape of train_y: {y_train_over.shape} \n')

In [None]:
lr_over = LogisticRegression(solver='liblinear')
lr_over.fit(X_train_over, y_train_over)

In [None]:
lr_score_over=get_metrics_score(lr_over,'LogisticRegression with over sampling',X_train_over,X_test,y_train_over,y_test)

The recall on test data is only 0.48 ,and model is overfitting there is lot of discrepancy between test score and train score. let try regularization

**What is Regularization ?**

Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data.This is where regularization helps. Regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

**Main  Regularization Techniques**

**Ridge Regression (L2 Regularization)**

`Ridge regression` adds “squared magnitude” of coefficient as penalty term to the loss function.


**Lasso Regression (L1 Regularizaion)**

`Lasso` adds  "absolute values of magnitude  of coefficient  as penalty term to the loss function

**Elastic Net Regression**

`Elastic net regression` combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1. 

Elastic Net Formula: Ridge + Lasso


### Regularization on Oversampled dataset

In [None]:
# Choose the type of classifier. 
pipe_lr_reg = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Grid of parameters to choose from
parameters = {'logisticregression__C': np.arange(0.007,0.5,0.01),
              'logisticregression__solver' : ['liblinear','newton-cg','lbfgs','sag','saga'],
              'logisticregression__penalty': ['l1','l2']
             }

# Run the grid search
grid_obj = RandomizedSearchCV(pipe_lr_reg, parameters, scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
pipe_lr_reg = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
pipe_lr_reg.fit(X_train_over, y_train_over)

In [None]:
lr_score_under=get_metrics_score(pipe_lr_reg,'LogisticRegression with Regularization on Over sampling',X_train_over,X_test,y_train_over,y_test)

The recall on test data has improved let see if undersampling can improve the recall

### Undersampling 
Let see try undersampling and see if performance is different.

In [None]:
rus = RandomUnderSampler(random_state = 1) # Undersample dependent variable
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
#Undersample to balance classes
print("Before Under Sampling, counts of label 'Attrited': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train==0)))

print("After Under Sampling, counts of label 'Attrited': {}".format(sum(y_train_under==1)))
print("After Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train_under==0)))

print('After Under Sampling, the shape of train_X: {}'.format(X_train_under.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_under.shape))
                                          

### Logistic Regression on undersampled data

In [None]:
# Initialize model using pipeline
pipe_lr_under = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Training the basic logistic regression model with training set 
pipe_lr_under.fit(X_train_under,y_train_under)

In [None]:
#Evaluate the model performance by using KFold and cross_val_score
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)     #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=pipe_lr_under, X=X_train_under, y=y_train_under, scoring=scoring, cv=kfold)

#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()

In [None]:
lr_score_under=get_metrics_score(pipe_lr_under,'LogisticRegression with under sampling',X_train_under,X_test,y_train_under,y_test)


### Observation

-  Model after undersampling is  generalized well on training and test set . Our recall after undersampling on test was better than our recall after oversampling on test.Let try regularization and see. Trying to use all the solver and different penality

In [None]:
# Choose the type of classifier. 
pipe_lr_reg_under = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Grid of parameters to choose from
parameters = {'logisticregression__C': np.arange(0.007,0.5,0.01),
              'logisticregression__solver' : ['liblinear','newton-cg','lbfgs','sag','saga'],
              'logisticregression__penalty': ['l1','l2']
             }

# Run the grid search
grid_obj = RandomizedSearchCV(pipe_lr_reg_under, parameters, scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train_under, y_train_under)

# Set the clf to the best combination of parameters
pipe_lr_reg_under = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
pipe_lr_reg_under.fit(X_train_under, y_train_under)

In [None]:
lr_score_reg=get_metrics_score(pipe_lr_reg_under,'LogisticRegression with Regularization on Undersampled',X_train_under,X_test,y_train_under,y_test)


<h1 style = "font-family:TimesNewRoman;color:black;font-weight:bold"> Model Performance Evaluation and Improvement-Logistic Regression</h1>

In [None]:
comparison_frame = pd.DataFrame({'Model':model_name,
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_F1':f1_train,
                                          'Test_F1':f1_test  }) 

#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)

Logistic Regression with Under sampling is giving a generalized model and best recall with 0.857. 

<h1 style = "font-family:TimesNewRoman;color:black;font-weight:bold">  Model building Decision Tree ,Bagging and Boosting</h1>

Here I am building different models using KFold and cross_val_score with pipelines and  will tune the best model 3 models using GridSearchCV and RandomizedSearchCV

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.

In [None]:
models = []  # Empty list to store all the models

# Appending pipelines for each model into the list
models.append(
    (
        "DTREE",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("decision_tree", DecisionTreeClassifier(random_state=1)),
            ]
        ),
    )
)

models.append(
    (
        "RF",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("random_forest", RandomForestClassifier(random_state=1)),
            ]
        ),
    )
)

models.append(
    (
        "BG",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("bagging", BaggingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "GBM",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("gradient_boosting", GradientBoostingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "ADB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("adaboost", AdaBoostClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "XGB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
            ]
        ),
    )
)


results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

- We can see that XGBoost is giving the highest cross-validated recall  with just one outlier followed by Gradient Boost, Adaboost. Bagging classifier had maxiumum recall ~ 84 but the minimum was 74 resulting into mean being only ~79. therefore I didn't choose bagging classfier.
- Best performing three models are XGBoost model, Gradient Boost, Adaboost.
- We will tune our 3 best models  to see if the performance improves after tuning

# Hyperparameter Tuning

We will use pipelines with StandardScaler and classifiers model and tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

**Random Search** . Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.

**Grid Search** Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

We can also use the make_pipeline function instead of Pipeline to create a pipeline.

**`make_pipeline`: This is a shorthand for the Pipeline constructor; it does not require and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.**

### Adaboost Using Grid Search

In [None]:
%%time
# Creating pipeline
pipe_ada_grid = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "adaboostclassifier__n_estimators": np.arange(10, 110, 10),
    "adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
pipe_ada_grid = GridSearchCV(estimator=pipe_ada_grid, param_grid=param_grid, scoring=scorer, cv=5,n_jobs = -1)

# Fitting parameters with undersampled train data in GridSeachCV
pipe_ada_grid.fit(X_train, y_train)
                              
print("Best parameters are {} with CV score={}:" .format(pipe_ada_grid.best_params_,pipe_ada_grid.best_score_))

In [None]:
# Creating new pipeline with best parameters
abc_tuned_grid = make_pipeline(
    StandardScaler(),AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2, 
                                                                              random_state=1),
                                        learning_rate=1, n_estimators=70))

# Fit the model on undersampled training data
abc_tuned_grid.fit(X_train, y_train)

In [None]:
abc_tuned_score=get_metrics_score(abc_tuned_grid,' Adaboost with Grid Search',X_train,X_test,y_train,y_test)


In [None]:
feature_names = X_train.columns
importances = abc_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- The test recall has increased by ~6% as compared to cross-validated recall
- Model is generalized , let see if random search give different result

### Adaboost Using Random Search

In [None]:
%%time

# Creating pipeline
pipe_ada_ran = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "adaboostclassifier__n_estimators": np.arange(10, 110, 10),
    "adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
abc_rand_cv = RandomizedSearchCV(estimator=pipe_ada_ran, param_distributions=param_grid, n_iter=10,n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
abc_rand_cv.fit(X_train,y_train)


print("Best parameters are {} with CV score={}:" .format(abc_rand_cv.best_params_,abc_rand_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
abc_tuned_rand = make_pipeline(
    StandardScaler(),AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                                              random_state=1),
                                        learning_rate=1, n_estimators=90))

# Fit the model on training data
abc_tuned_rand.fit(X_train, y_train)


In [None]:
abc_rand_tuned_score=get_metrics_score(abc_tuned_rand,' Adaboost with Random Search',X_train,X_test,y_train,y_test)


- Here Random  search took a less  time , but the recall has improved with the  random search.False negative cases have reduced.
- Grid search took a significantly longer time than random search. 

In [None]:
feature_names = X_train.columns
importances = abc_tuned_rand[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

### GradientBoosting with Grid Search

In [None]:
%%time
# Creating pipeline
pipe_gb_grid = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))

# Grid of parameters to choose from
param_grid = {'gradientboostingclassifier__n_estimators':[100,200],
              'gradientboostingclassifier__max_depth':[10,20],
              'gradientboostingclassifier__min_samples_leaf': [10,20],
              'gradientboostingclassifier__min_samples_split': [25,35]
              }
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_cv = GridSearchCV(pipe_gb_grid, param_grid, scoring=scorer,cv=5,n_jobs = -1)

# Fitting parameters in GridSeachCV
pipe_gb_grid = grid_cv.fit(X_train, y_train)


print("Best parameters are {} with CV score={}:" .format(pipe_gb_grid.best_params_,grid_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
gb_tuned_grid = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(max_depth=20,
                                            min_samples_leaf=20,
                                            min_samples_split=25,
                                            n_estimators=200, random_state=1
                                            ))

# Fit the model on training data
gb_tuned_grid.fit(X_train, y_train)

In [None]:
gb_tuned_score=get_metrics_score(gb_tuned_grid,' Gradient with Grid Search',X_train,X_test,y_train,y_test)

In [None]:
feature_names = X_train.columns
importances = gb_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

- Grid search gave a better recall than cross validation. 
- Model is overfitting. Let see how randomized search perform.

### GradientBoosting with Random Search

In [None]:
%%time 
pipe_gb_rand = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))

param_grid = {'gradientboostingclassifier__n_estimators':[100,200],
              'gradientboostingclassifier__max_depth':[10,20],
              'gradientboostingclassifier__min_samples_leaf': [10,20],
              'gradientboostingclassifier__min_samples_split': [25,35]
              }


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
pipe_gb_rand = RandomizedSearchCV(estimator=pipe_gb_rand, param_distributions=param_grid,n_jobs = -1, n_iter=10, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
pipe_gb_rand.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(pipe_gb_rand.best_params_,pipe_gb_rand.best_score_))

In [None]:
gb_tuned_rand = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(max_depth=20, min_samples_leaf=20,
                                            min_samples_split=25,
                                            n_estimators=200,
                                            random_state=1))

# Fit the model on training data
gb_tuned_rand.fit(X_train, y_train)

In [None]:
gb_rand_tuned_score=get_metrics_score(gb_tuned_rand,' Gradient boosting with Random Search',X_train,X_test,y_train,y_test)

- Grid search took a significantly longer time than random search. 
- Both the model perform the same , both have same recall and are overfitting

In [None]:
feature_names = X_train.columns
importances = gb_tuned_rand[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

### XGBclassifier with Grid Search

In [None]:
%%time 

#Creating pipeline
#Creating pipeline
pipe_xgboost=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))

#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[2,10],
            'xgbclassifier__learning_rate':[0.01,0.1,0.2], 
            'xgbclassifier__subsample':[0.7,1]}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling GridSearchCV
Xgboost_grid_cv = GridSearchCV(estimator=pipe_xgboost, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)

#Fitting parameters in GridSeachCV
Xgboost_grid_cv.fit(X_train,y_train)


print("Best parameters are {} with CV score={}:" .format(Xgboost_grid_cv.best_params_,Xgboost_grid_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
xgb_tuned_grid = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        random_state=1,
        n_estimators=150,
        scale_pos_weight=10,
        subsample=1,
        learning_rate=0.01,
        eval_metric='logloss',
    ),
)

# Fit the model on training data
xgb_tuned_grid.fit(X_train, y_train)

In [None]:
xgb_tuned_score_grid=get_metrics_score(xgb_tuned_grid,' XGboost with Grid Search',X_train,X_test,y_train,y_test)

- Recall has improved by ~9% using grid search and hyperparameter.

### XGboost using Random Search

In [None]:
%%time 
#Creating pipeline

pipe_xgboost_ran=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))

#Parameter grid to pass in random
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[2,10],
            'xgbclassifier__learning_rate':[0.01,0.1,0.2], 
            'xgbclassifier__subsample':[0.7,1]}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe_xgboost_ran, param_distributions=param_grid,n_jobs = -1, n_iter=10, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
xgb_rand = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        random_state=1,
        n_estimators=50,
        scale_pos_weight=10,
        subsample=0.7,
        learning_rate=0.01,
        eval_metric='logloss',
    ),
)

# Fit the model on training data
xgb_rand.fit(X_train, y_train)

In [None]:
randomized_cv_tuned_score=get_metrics_score(randomized_cv,'XG boosting with Random Search',X_train,X_test,y_train,y_test)


- Random search took less time but recall was not better than Grid search

### Comparing all models

In [None]:
comparison_frame = pd.DataFrame({'Model':model_name,
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_F1':f1_train,
                                          'Test_F1':f1_test  }) 

#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)

- Logistic Regression with oversampling performed very poorly on test data. The recall was only 0.48.
- The xgboost model tuned using Grid search is giving the best test recall of 0.95. The model can 93% time accuractely predict customers who will attrite.Precision is very low for this model.
- Time take by Random search was less compared to time taken by Grid search, but that doesn't necessarily mean that the performance was better .  Since not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions random search is faster. The number of parameter settings that are tried is given by n_iter. All set of hyperparameters is not searched sequentially.Random search doesn’t guarantee finding the best set of hyperparameters.Grid search did slightly better in case of XGboost and Adaboost.
- The performance can vary with range of hyperparmeters selected. With more number of hyperparemter grid search will probably take more time.
- Let's see the feature importance from the tuned xgboost model.

In [None]:

feature_names = X_train.columns
importances = xgb_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Total Transcation count is most important features followed by Total Revolving balance and Total Transacational amount.

[Top](#Table-of-Contents)

## Conclusion
- Random Grid search takes less time and tries to choose best parameters , but that doesnt necessarily mean it will perform well.
- Different hyperparamters can be tried to improve some of the model.
- XGbooost with grid search performed best.
- Total Transcation count is most important features followed by Total Revolving balance and Total Transacational amount.
- Customers lower transcation , lower revolving balance , lower transcational amount are an indication that customer will attrite.

<h2 style = "font-family:TimesNewRoman;color:black;font-weight:bold">Business Recommendations & Insights</h2>

* Lower transcation count on credit card , less revolving balance , less transcational amount are an indication that customer will attrite. Lower transcation indicate customer is not using this credit card , bank should offer more rewards or cashback  or some other offers to customer to use the credit card more.
* As per the EDA if customer hold more product with the bank he/she is less likely to attrite.Bank can offer more product to  such customers so they buy  more products which will help retain such customers
* Customers who have been inactive for a month show high chances of attrition.Bank should focus on such customers as well.
* Avg utilization ratio is lower amongst attrited customers.
* As per EDA Customer in age range 36-55 ,who were doctorate or postgraduate ,or Female attrited more. One of the reasons can be some competitive bank is offering them better deals leading to lesser user of this banks credit card.
* As per the EDA Customers who have had high number of contacts with the bank in the last 12 months have attrited. This needs to be investigated whether there were any  issues of customers which were not resolved leading into customer leaving the bank.






Here are my other notebooks....Do checkout if you find my work helpful, happy learning.

1.[Predicting diabetes ](https://www.kaggle.com/yogidsba/diabetes-prediction-eda-model)

2.[Predict if customer will buy Personsal Loan](https://www.kaggle.com/yogidsba/personal-loan-logistic-regression-decision-tree/edit/run/65292079)

3.[Insurance Claim Hypothesis Testing](http://www.kaggle.com/yogidsba/insurance-claims-eda-hypothesis-testing)

4.[Basic EDA on Covid vaccination](http://www.kaggle.com/yogidsba/basic-eda-on-covid-vaccination)

5.[Pandas Tutorial](http://www.kaggle.com/yogidsba/pandas-function-and-data-analysis)

6.[Case study EDA on cardio good fitness](http://www.kaggle.com/yogidsba/casestudy-eda-for-cardio-good-fitness)

7.[Predict Price of used cars using Linear Regression](https://www.kaggle.com/yogidsba/predict-used-car-prices-linearregression)

8.[Rail Road Accident ](https://www.kaggle.com/yogidsba/rail-road-crossing-accident-in-progress)

9.[Travel Package Prediction](https://www.kaggle.com/yogidsba/travelpackageprediction-ensemble-techniques)

[Top](#Table-of-Contents)