Hello ðŸ™Œ, welcome to my notebook. In this notebook we will try to exploring churn data and also develop and evaluate model to predict churn customer. Feel free if you have any question or suggestion! Thank you!

# Task & Description

Description:
- A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers' decisions in the opposite direction

- Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features. We have only 16.07% of customers who have churned. Thus, it's a bit difficult to train our model to predict churning customers.

Task:
1. We are expecting a notebook having in-depth Exploratory Data Analysis that can help us visualize where the difference lies between churning and non-churning customers.
2. Improve Performance of predicting churned customers

# Import Data

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

data1 = pd.read_csv('../input/credit-card-customers/BankChurners.csv')

In [None]:
data1.drop(data1.columns[[21,22]], axis=1, inplace=True)

As ordered before, we will drop the 'not needed' features

In [None]:
data1.head()

In [None]:
data1.info()

In [None]:
data1.shape

In [None]:
data1.describe()

# Feature Defenition

- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0
- Customer_Age: Demographic variable - Customer's Age in Years
- Gender: Demographic variable - M=Male, F=Female
- Dependent_count: Demographic variable - Number of dependents
- Education_Level: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
- Marital_Status: Demographic variable - Married, Single, Divorced, Unknown
- Income_Category: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
- Card_Category: Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
- Months_on_book: Period of relationship with bank
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: Total Revolving Balance on the Credit Card
- Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio: Average Card Utilization Ratio

In [None]:
'''Nunique Columns'''

def nunique_counts(data):
   for i in data.columns:
       count = data[i].nunique()
       print(i, ": ", count)
nunique_counts(data1)

In [None]:
'''Unique Columns'''

def unique_counts(data):
    features = data1.dtypes[data1.dtypes == "object"].index.values.tolist()
    for i in features:
        count = data[i].unique()
        print(i, ": ", count, len(count))
        
unique_counts(data1)

In [None]:
'''Label Encoding With Label'''
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(data1['CLIENTNUM'])

l = [i for i in range(10127)]
dict(zip(list(le.classes_), l))

data1['CLIENTNUM'] = le.transform(data1['CLIENTNUM'])

In [None]:
data1['CLIENTNUM'].unique()

In [None]:
'''Checking Duplicate'''

print('Dupplicate entries: {}'.format(data1.duplicated().sum()))
# data1.drop_duplicates(inplace = True)

In [None]:
'''Missing Value Chart'''
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
data1.isnull().mean(axis=0).plot.barh()
plt.title("Ratio of missing values per columns")
print(data1.isnull().values.sum()) #total missing values

- We have 10127 rows, 21 features. Which i think is very small data to predict churn rate. But we will try our best!
- I changing the item label on Clientum feature, to make it simple
- We also dont have missing values here

# Data Exploration

In [None]:
import plotly.offline as py 
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go 
import plotly.tools as tools
import warnings
from collections import Counter 
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, 
                    cols=2, 
                    specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=('Atrition Flag','Gender'))

# Based on Attrition_Flag

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Attrition_Flag").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Client Type'] = data2.index

labels = data2['Client Type'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Client Type"),
                    1, 1)


# Based on Attrition_Flag

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Gender").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Gender'] = data2.index

labels = data2['Gender'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Gender"),
                    1, 2)

fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')

fig['layout'].update(height=500, width=900, title='Number of Client based on:')
fig.show()

In [None]:
fig = make_subplots(rows=1, 
                    cols=2, 
                    specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=('Educational Level','Marital Status'))

# Based on Education_Level

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Education_Level").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Education Level'] = data2.index

labels = data2['Education Level'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Education Level"),
                    1, 1)


# Based on Marital_Status

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Marital_Status").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Marital Status'] = data2.index

labels = data2['Marital Status'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Marital Status"),
                    1, 2)

fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')

fig['layout'].update(height=500, width=900, title='Number of Client based on:')
fig.show()

In [None]:
fig = make_subplots(rows=1, 
                    cols=2, 
                    specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=('Income Category','Card Category'))

# Based on Income_Category

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Income_Category").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Income Category'] = data2.index

labels = data2['Income Category'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Income Category"),
                    1, 1)


# Based on Card_Category

custom_aggregation = {}
custom_aggregation["CLIENTNUM"] = "count"
data2 = data1.groupby("Card_Category").agg(custom_aggregation)
data2.columns = ["Number of Client"]
data2['Card Category'] = data2.index

labels = data2['Card Category'].tolist()
values = data2['Number of Client'].tolist()

fig.add_trace(go.Pie(
                    labels=labels, 
                    values=values, 
                    name="Card Category"),
                    1, 2)

fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')

fig['layout'].update(height=500, width=900, title='Number of Client based on:')
fig.show()

- I'm trying to knowing the size of each item in feature
- For the dataset given, majority of the custmer are:

    1. Existing customer: 83,9%
    2. Female: 52,9%
    3. Graduate: 30,9%
    4. Married: 46,3%
    5. Income less than $40K: 35,2%
    6. Blue card category: 93,2%

In [None]:
import plotly.express as px

fig = px.box(data1, x="Attrition_Flag", y="Customer_Age", color="Attrition_Flag", boxmode="overlay")

fig['layout'].update(height=500, width=750, title='Customer Age Based on Attrition Flag')
fig.update_traces(quartilemethod="inclusive")
fig.show()

In [None]:
fig = px.box(data1, x="Attrition_Flag", y="Dependent_count", color="Attrition_Flag", boxmode="overlay")

fig['layout'].update(height=500, width=750, title='Dependent Count Based on Attrition Flag')
fig.update_traces(quartilemethod="inclusive")
fig.show()

- For the age, there is no significant difference between Attrited and Existing customer
- Attrited customer tend to have high depenedent count compare ti existing customer

In [None]:
fig = make_subplots(rows=1, 
                    cols=1)

x = data1['Total_Ct_Chng_Q4_Q1'].tolist()

data_ex = data1.loc[data1['Attrition_Flag'] == 'Existing Customer']
data_at = data1.loc[data1['Attrition_Flag'] == 'Attrited Customer']


y1 = data_ex['Total_Amt_Chng_Q4_Q1'].tolist()
y2 = data_at['Total_Amt_Chng_Q4_Q1'].tolist()

fig.add_trace(go.Scatter(x=x, y=y1,name='Existing Customer',line=dict(color='darkgreen', width=2), mode="markers"), 1, 1)
fig.add_trace(go.Scatter(x=x, y=y2,name='Attrited Customer',line=dict(color='pink', width=2), mode="markers"), 1, 1)

fig['layout'].update(height=500, 
                     width=900, 
                     title='Scatter Plot Count Change x Total Amount Change',
                     xaxis_title="Count Change",
                     yaxis_title="Amount Change")
fig.show()

In [None]:
fig = make_subplots(rows=1, 
                    cols=1)

x = data1['Total_Amt_Chng_Q4_Q1'].tolist()

data_ex = data1.loc[data1['Attrition_Flag'] == 'Existing Customer']
data_at = data1.loc[data1['Attrition_Flag'] == 'Attrited Customer']


y1 = data_ex['Total_Trans_Amt'].tolist()
y2 = data_at['Total_Trans_Amt'].tolist()

fig.add_trace(go.Scatter(x=x, y=y1,name='Existing Customer',line=dict(color='darkgreen', width=2), mode="markers"), 1, 1)
fig.add_trace(go.Scatter(x=x, y=y2,name='Attrited Customer',line=dict(color='pink', width=2), mode="markers"), 1, 1)

fig['layout'].update(height=500, 
                     width=900, 
                     title='Scatter Plot Amount Change x Total Transaction',
                     xaxis_title="Amount Change",
                     yaxis_title="Total Transaction")
fig.show()

In [None]:
fig = make_subplots(rows=1, 
                    cols=1)

x = data1['Total_Trans_Amt'].tolist()

data_ex = data1.loc[data1['Attrition_Flag'] == 'Existing Customer']
data_at = data1.loc[data1['Attrition_Flag'] == 'Attrited Customer']

y1 = data_ex['Avg_Open_To_Buy'].tolist()
y2 = data_at['Avg_Open_To_Buy'].tolist()

fig.add_trace(go.Scatter(x=x, y=y1,name='Existing Customer',line=dict(color='darkgreen', width=2), mode="markers"), 1, 1)
fig.add_trace(go.Scatter(x=x, y=y2,name='Attrited Customer',line=dict(color='pink', width=2), mode="markers"), 1, 1)

fig['layout'].update(height=500, 
                     width=900, 
                     title='Scatter Plot Total Transaction x Avg. Open to Buy',
                     xaxis_title="Total Transaction",
                     yaxis_title="Avg. Open to Buy")
fig.show()

In [None]:
fig = make_subplots(rows=1, 
                    cols=1)

x = data1['Total_Trans_Amt'].tolist()

data_ex = data1.loc[data1['Attrition_Flag'] == 'Existing Customer']
data_at = data1.loc[data1['Attrition_Flag'] == 'Attrited Customer']

y1 = data_ex['Total_Revolving_Bal'].tolist()
y2 = data_at['Total_Revolving_Bal'].tolist()

fig.add_trace(go.Scatter(x=x, y=y1,name='Existing Customer',line=dict(color='darkgreen', width=2), mode="markers"), 1, 1)
fig.add_trace(go.Scatter(x=x, y=y2,name='Attrited Customer',line=dict(color='pink', width=2), mode="markers"), 1, 1)

fig['layout'].update(height=500, 
                     width=900, 
                     title='Scatter Plot Total Transaction x Revolving Balance',
                     xaxis_title="Total Transaction",
                     yaxis_title="Revolving Balance")
fig.show()

- Count x Amount change: Attrited customer tend to have lower amount change distribution compared to existing customer
- Amount Change x Total Transaction: Attrited customer tend to have lower total transaction distribution compared to existing customer

In [None]:
custom_aggregation = {}
custom_aggregation["Months_on_book"] = "mean"
custom_aggregation["Total_Relationship_Count"] = "mean"
custom_aggregation["Months_Inactive_12_mon"] = "mean"
custom_aggregation["Contacts_Count_12_mon"] = "mean"
custom_aggregation["Credit_Limit"] = "mean"
custom_aggregation["Total_Revolving_Bal"] = "mean"

data2 = data1.groupby("Attrition_Flag").agg(custom_aggregation)
data2['Customer'] = data2.index

d1 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Months_on_book"],
    name='Months on Book')

d2 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Relationship_Count"],
    name='Total Relationship')

d3 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Months_Inactive_12_mon"],
    name='Months Inactive')

d4 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Contacts_Count_12_mon"],
    name='Contact Count')

d5 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Credit_Limit"],
    name='Credit Limit')

d6 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Revolving_Bal"],
    name='Revolving Balance')



data = [d1,d2,d3,d4,d5,d6]

fig = tools.make_subplots(rows=3, 
                          cols=2, 
                          #specs=[[{}, {}], [{'colspan': 1}, None]],
                          subplot_titles=('Months on Book',
                                         'Total Relationship',
                                         'Months Inactive',
                                         'Contact Count',
                                         'Credit Limit',
                                         'Revolving Balance'))

fig.append_trace(d1, 1, 1)
fig.append_trace(d2, 1, 2)
fig.append_trace(d3, 2, 1)
fig.append_trace(d4, 2, 2)
fig.append_trace(d5, 3, 1)
fig.append_trace(d6, 3, 2)

fig['layout'].update(height=1000, width=900, title='Attrited vs Existing Customer', boxmode='group')
py.iplot(fig, filename='combined-savings')

In [None]:
custom_aggregation = {}
custom_aggregation["Avg_Open_To_Buy"] = "mean"
custom_aggregation["Total_Amt_Chng_Q4_Q1"] = "mean"
custom_aggregation["Total_Trans_Amt"] = "mean"
custom_aggregation["Total_Trans_Ct"] = "mean"
custom_aggregation["Total_Ct_Chng_Q4_Q1"] = "mean"
custom_aggregation["Avg_Utilization_Ratio"] = "mean"

data2 = data1.groupby("Attrition_Flag").agg(custom_aggregation)
data2['Customer'] = data2.index

d1 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Avg_Open_To_Buy"],
    name='Avg. Open to Buy')

d2 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Amt_Chng_Q4_Q1"],
    name='Amount Change')

d3 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Trans_Amt"],
    name='Transaction Amount')

d4 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Trans_Ct"],
    name='Transaction Count')

d5 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Total_Ct_Chng_Q4_Q1"],
    name='Count Change')

d6 = go.Bar(
    x = data2.Customer.value_counts().index.sort_values(),
    y = data2["Avg_Utilization_Ratio"],
    name='Avg. Utilization Ratio')


data = [d1,d2,d3,d4,d5,d6]

fig = tools.make_subplots(rows=3, 
                          cols=2, 
                          #specs=[[{}, {}], [{'colspan': 1}, None]],
                          subplot_titles=('Avg. Open to Buy',
                                         'Amount Change',
                                         'Transaction Amount',
                                         'Transaction Count',
                                         'Count Change',
                                         'Avg. Utilization Ratio'))

fig.append_trace(d1, 1, 1)
fig.append_trace(d2, 1, 2)
fig.append_trace(d3, 2, 1)
fig.append_trace(d4, 2, 2)
fig.append_trace(d5, 3, 1)
fig.append_trace(d6, 3, 2)

fig['layout'].update(height=1000, width=900, title='Attrited vs Existing Customer', boxmode='group')
py.iplot(fig, filename='combined-savings')

- I try to make overall chart on each feature that distinguished attrited and existing customer
- For the values, i using mean
- We can see that feature which have significant difference are: Revolving Balanced, Transaction Amount, Transaction Count, Count Change and also Avg. Utilization Ratio
- Further inspection is needed to get know on each feature which have significant difference, because i think in modeling later, these feature may have significant correlation with our target feature

# Pre Processing

In [None]:
data1['Attrition_Flag'].unique()

In [None]:
data1['Attrition_Flag'][data1['Attrition_Flag'] == 'Existing Customer'] = 0
data1['Attrition_Flag'][data1['Attrition_Flag'] == 'Attrited Customer'] = 1

In [None]:
from sklearn.preprocessing import LabelEncoder

le.fit(data1['Gender'])

data1['Gender'] = le.transform(data1['Gender'])

l = [i for i in range(2)]
dict(zip(list(le.classes_), l))

In [None]:
le.fit(data1['Education_Level'])

data1['Education_Level'] = le.transform(data1['Education_Level'])

l = [i for i in range(7)]
dict(zip(list(le.classes_), l))

In [None]:
le.fit(data1['Marital_Status'])

data1['Marital_Status'] = le.transform(data1['Marital_Status'])

l = [i for i in range(4)]
dict(zip(list(le.classes_), l))

In [None]:
le.fit(data1['Income_Category'])

data1['Income_Category'] = le.transform(data1['Income_Category'])

l = [i for i in range(6)]
dict(zip(list(le.classes_), l))

In [None]:
le.fit(data1['Card_Category'])

data1['Card_Category'] = le.transform(data1['Card_Category'])

l = [i for i in range(4)]
dict(zip(list(le.classes_), l))

In [None]:
le.fit(data1['Attrition_Flag'])

data1['Attrition_Flag'] = le.transform(data1['Attrition_Flag'])

l = [i for i in range(2)]
dict(zip(list(le.classes_), l))

In [None]:
data1.groupby('Attrition_Flag').size()

In [None]:
data1 = data1.sample(frac=1)

exis = data1.loc[data1['Attrition_Flag'] == 1]
attr = data1.loc[data1['Attrition_Flag'] == 0][:1627]


normal_distributed_df = pd.concat([exis, attr])

# Shuffle dataframe rows
data2 = normal_distributed_df.sample(frac=1, random_state=42)

data2.head()

In [None]:
data2.groupby('Attrition_Flag').size()

In [None]:
data2.info()

- I encoding all object feature to numerical using Label Encoder, but fro our target feature i changed it manuallis. For attrited labelled as 1 and existing labelled as 0
- This because later when evaluate our model, the result won't make a bias
- Need to know that our data consist very small information about churn customer which is about 1627 compared to existing customer which have 8500
- In this case we are dealing with unbalanced dataset
- On order to overcome this problem, we need to make resample
- I using Random Over Sampling which will balanced our dataset so it will have same size
- After resampling we can see that the size of our item in target feature is same

# Modelling

In [None]:
fig = make_subplots(rows=1, 
                    cols=1)
cor = data2.corr()
cor_ = cor.index
cor__ = cor.values

fig.add_trace(go.Heatmap(
                    x=cor_,
                    y=cor_,
                    z=cor__,
                    name='Correlation',
                    showscale=False,
                    xgap=0.7,
                    ygap=0.7), 1, 1)


fig['layout'].update(height=600, 
                     width=600, 
                     title='Heat Map',
                     xaxis_title=" ",
                     yaxis_title=" ")
fig.show()

In [None]:
data2.corr()['Attrition_Flag'].sort_values(ascending=False)

In [None]:
data2.corr()['Attrition_Flag'].drop('Attrition_Flag').sort_values(ascending=False)[:9].index.tolist()

In [None]:
from sklearn.model_selection import train_test_split

train_data,test_data = train_test_split(data2,train_size = 0.7,random_state=3)

In [None]:
train_data.shape, test_data.shape

In [None]:
'''Pre Model Selection'''
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import KFold

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('XGB', XGBClassifier(use_label_encoder=False)))

results = []
names = []
scoring = 'precision'

features = data2.drop('Attrition_Flag', axis=1).columns.values.tolist()

for name, model in models:
        kfold = KFold(n_splits=10, random_state=None)
        cv_results = cross_val_score(model, train_data[features], train_data['Attrition_Flag'], cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
fig = plt.figure(figsize=(11,6))
fig.suptitle('Algorithm Comparison by Precision')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
results = []
names = []
scoring = 'recall'

for name, model in models:
        kfold = KFold(n_splits=10, random_state=None)
        cv_results = cross_val_score(model, train_data[features], train_data['Attrition_Flag'], cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
fig = plt.figure(figsize=(11,6))
fig.suptitle('Algorithm Comparison by Recall')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
evaluation = pd.DataFrame({'Model': [],
                           'Details':[],
                           'Accuracy':[],
                          'CVS':[],
                          'Recall':[],
                          'Precision':[],
                          'F1':[]})

In [None]:
'''RANDOM FOREST CLASSIFIER'''
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

rfc = RandomForestClassifier(min_samples_leaf = 3, 
                               min_samples_split=7, 
                               n_estimators = 75,
                               max_samples=0.5)

estimator = rfc.fit(train_data[features], train_data['Attrition_Flag'])
predict = rfc.predict(data1[features])                                                                                
acc = (accuracy_score(data1['Attrition_Flag'], predict)*100)
cvs = (cross_val_score(rfc, train_data[features], train_data['Attrition_Flag'], cv=5).mean())*100
cvp = cross_val_predict(rfc, train_data[features], train_data['Attrition_Flag'], cv=5)
recall = recall_score(data1['Attrition_Flag'], predict)*100
precision = precision_score(data1['Attrition_Flag'], predict)*100
f1 = f1_score(data1['Attrition_Flag'], predict)*100     
                                                                                                                                                                                                                                             
r = evaluation.shape[0]
evaluation.loc[r] = ['Random Forest Classifier','Select Feature',acc, cvs,recall, precision, f1]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''DECISION TREE CLASSIFIER'''

dtc = DecisionTreeClassifier()

estimator = dtc.fit(train_data[features], train_data['Attrition_Flag'])
predict = dtc.predict(data1[features])                                                                                
acc = (accuracy_score(data1['Attrition_Flag'], predict)*100)
cvs = (cross_val_score(dtc, train_data[features], train_data['Attrition_Flag'], cv=5).mean())*100
cvp = cross_val_predict(dtc, train_data[features], train_data['Attrition_Flag'], cv=5)
recall = recall_score(data1['Attrition_Flag'], predict)*100
precision = precision_score(data1['Attrition_Flag'], predict)*100
f1 = f1_score(data1['Attrition_Flag'], predict)*100     
                                                                                                                                                                                                                                             
r = evaluation.shape[0]
evaluation.loc[r] = ['Decision Tree Classifier','Select Feature',acc, cvs,recall, precision, f1]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''XGB CLASSIFIER'''

xgb = XGBClassifier(use_label_encoder=False,
                   max_depth = 25)

estimator = xgb.fit(train_data[features], train_data['Attrition_Flag'])
predict = xgb.predict(data1[features])                                                                                
acc = (accuracy_score(data1['Attrition_Flag'], predict)*100)
cvs = (cross_val_score(xgb, train_data[features], train_data['Attrition_Flag'], cv=5).mean())*100
cvp = cross_val_predict(xgb, train_data[features], train_data['Attrition_Flag'], cv=5)
recall = recall_score(data1['Attrition_Flag'], predict)*100
precision = precision_score(data1['Attrition_Flag'], predict)*100
f1 = f1_score(data1['Attrition_Flag'], predict)*100     
                                                                                                                                                                                                                                             
r = evaluation.shape[0]
evaluation.loc[r] = ['XGB Classifier','Select Feature',acc, cvs,recall, precision, f1]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
import numpy as np
import seaborn as sns

def plot_feature_importance(importance,names,model_type):
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    plt.figure(figsize=(10,8))

    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])

    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(rfc.feature_importances_,features,'RANDOM FOREST CLASSIFIER ')

In [None]:
plot_feature_importance(xgb.feature_importances_,features,'XGB CLASSIFIER ')

In [None]:
plot_feature_importance(dtc.feature_importances_,features,'DTC CLASSIFIER ')

In [None]:
'''ROC Curve'''
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_curve

model_pred = cross_val_predict(xgb, train_data[features], train_data['Attrition_Flag'], cv=5)

model_fpr, model_tpr, model_thresold = roc_curve(train_data['Attrition_Flag'], model_pred)

def graph_roc_curve(model_fpr, log_tpr):
    plt.figure(figsize=(10,6))
    plt.title('ROC Curve', fontsize=16)
    plt.plot(model_fpr, model_tpr, 'b-', linewidth=2, label='Model Score: {:.4f}'.format(roc_auc_score(train_data['Attrition_Flag'], model_pred)))
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.axis([-0.01,1,0,1])
#     plt.legends()
    
graph_roc_curve(model_fpr, model_tpr)
plt.show()

In [None]:
'''Learning Curve'''
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, X, y, 
                        ylim=None, 
                        cv=None, 
                        n_jobs=1, 
                        train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.figure(figsize=(17,11))
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="#ff9124")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff", label="Cross-validation score")
    plt.title("Model Learning Curve", fontsize=14)
    plt.xlabel('Training size (m)')
    plt.ylabel('Score')
    plt.grid(True)
    plt.legend(loc="best")
    
    return plt

In [None]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)

plot_learning_curve(rfc,
                    train_data[features], 
                    train_data['Attrition_Flag'], 
                    (0.87, 1.01), cv=cv, n_jobs=4)

In [None]:
'''Confusion Matrix'''
from sklearn.metrics import confusion_matrix

RF_matrix = confusion_matrix(data1['Attrition_Flag'], predict)

plt.figure(figsize=(15,8))
sns.heatmap(RF_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel2")
plt.title("XGB Confusion Matrix", weight='bold')
plt.xlabel('Predicted Labels')
plt.ylabel('Actual Labels')

In [None]:
'''MODEL SCORE'''
from sklearn.metrics import classification_report

print("Random Forest Classifier", classification_report(data1['Attrition_Flag'], predict))

In [None]:
data1['Predict'] = rfc.predict(data1[features])

In [None]:
data1[['Attrition_Flag', 'Predict']].head(10)

# Result & Conclusion

- I split train and test data for 70:30
- After splitting we get the shape data both for out train and test are: (2277, 21), (977, 21)
- For the feature, i only selected feature which have contribute positively with our target feater
- Before modelly, i try to figure out which model to perform based on precision and recall
- I decided to modelling using Random Forest Classifier, Decision Tree Classifier and XGBoost Classifer
- And for the testing i decided to use all data, not data from splitting. It for widing out test size
- After developting the model the higest recall score is XGB Model, which we select for predicting the churn customer
- And then try to searching most importance feature in our model, which is:

    1. Total Transaction Count
    2. Total Revolving Balanced
    3. Total Relationship Count
    
    
- I try to plot ROC Curve and Learning Curve to evaluate our model
- Both ROC and Learning Curve plot seem to showing good result to our model

Evaluation: On connfusion matriks we see that our model predict,

- 1610 item predicted as Churn which is correctly predicted (***True Positive***)
- 17 item predicted as Not Churn which is uncorrectly predicted (***False Negative***)

     ***Which lead us to get 98,95% Recall Score ( TP / ( TP + FN )***
     
    
- 8053 item predicted as Not Churn which is correctly predicted (***True Negative***)    
- 447 item predicted as Churn which is uncorrectly predicted (***False Positive***)

     ***Which lead us to get 95,41% Accuracy Score ( ( TP + TN ) / ( TP + FN + TN + FP )***
    
    

Finish, don't forget to upvote. Thank You!:)