In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Who is gonna churn us ?

A bank offering credit card subscription to their clients is realizing that some customers are releasing their credit card subscription (churn). So they want us to explain how we can predict this in the future, and to discover the type of customer who are sensible to release their subscription. With this information they will be able to approach them before the churn and propose them new services to encourage them staying with the bank. 

How are we gonna do that ? With the data they provide us on the credit card activity for 11027 Customers. Of course the customers identity have been anonymized. 

Based on the data and with our analysis we will sort out valuable insights to predict possible future churning customer. 

The work is separated in differents parts:

[Table of contents:](#ToC)

[Importing all the necessary libraries](#0thPart)

[1. Data description and our goal](#1stPart)

[2. Data exploration for first level insights](#2ndPart)

[3. Full features Machine learning model vs Model with first level insights](#3rdPart)

[4. Conclusion](#4thPart)

> Why do we care about churning?
* `reducing customer churn by 5% can increase profits 25–125%`;
* it is estimated that `It costs 5 times more to acquire new customers than it does to keep the current ones`;

These two simple lines underlines that it is crucial for a company to have as many loyal customers as possible and to detect all possible causes of attrition and prone-to-churn clients.

For this reasons, *interpretability* of the Machine Learning results is the main focus in this notebook. Having an outstanding Machine Learning pipeline is useless, if no business intuition or decision can be based on its results.

**Sources:**
- [What is Customer Churn & How to Reduce It?](https://medium.com/@paldesk/what-is-customer-churn-how-to-reduce-it-402460e5b569)
- [Retain more customers by understanding churn](https://medium.com/data-science-at-microsoft/retain-more-customers-by-understanding-churn-ae31d9b2aa2b)



<a id='0thPart'></a>
## Importing all necesaries libraries:


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder

<a id='1stPart'></a>
# 1. Data description and our goal

In [None]:
# First let's visualize the dataset:

df = pd.read_csv("../input/credit-card-customers/BankChurners.csv")
cols = list(df.columns)
df_copy = df
df = df.drop(columns = cols[-2:],axis = 1)
df.head(30)

## Dataset description:
<a style="font-size:12px;" href="#ToC">Back to Table of Contents</a>

<span style="font-size:16px;color:Green"><b>INTRODUCTION</b></span>

The purpose of this section is to become acquainted with the dataset.Two main needs have to be taken into account:
1. understand what kind of variables are inside the dataset and what is their meaning. Functional knowledge allows to create new variables and get insights from our machine learning output;
2. get insights that could bring to better output results, such as which variable are related to the target variable or with each other. This is the stepping stone for a good model, since it allows to select the right number of variables (avoiding useless or correlated features) in agreement with Occam's razor principle;  

<span style="font-size:16px;color:Green"><b>RESULTS:</b></span>

In the dataset we can find three main classes of features:
* **anagraphical features:** *Customer_Age, Gender, Education_Level, Marital_Status, Income_Category*. Their meaning is straightforward;

* **customer-bank relationship features:** 
    - Dependent_count: [number of people uses that specific account](https://www.kaggle.com/sakshigoyal7/credit-card-customers/discussion/201767);
    - Card_Category: is this a Premium or a Basic account?
    - Months_on_book: the duration of the relationship as of now;
    - Total_Relationship_Count: Total number of products held by the customer. In other words, client could have other products like debit card, loans, and so on;
    - Contacts_Count_12_mon: the number of contacts between the customer and the bank in the last 12 months. It could be a key indicator of the satisfaction level of the client: the more contacts, the higher the probability that there is something that causes attrition;
    
    
* **Credit Card utilization features:**
    - Months_Inactive: it determines how many months the client has been inactive. However this variable is not entirely clear to me: official documentation states that this inactivity status is recorded among the last 12 months. However the highest value of this variable is 6 and it is not clear whether this is the number of *consecutive* months of inactivity. In my personal opinion, this variable shows the maximum number of consecutive months of inactivity and a customer is classified as *churned* after 6 or 7 months of inactivity;
    - Credit_Limit: this is the maximum amount the client is allowed to use;
    - Total_Revolving_Bal: the debt amount. For example the revolving balance value in february is determined by: $$Debt\_February=Debt\_January+CreditUsed\_February-DebtPaid\_February$$
    - Avg_Open_To_Buy: suppose a client has used 500£, and its credit limit is 2500£. The custormer is thus *open to buy* 2500-500=2000£. Avg_Open_To_Buy is the average over the last 12 months of the Open To Buy value;
    - Total_Trans_Amt: total transactional amount in the last 12 months;
    - Total_Amt_Chng_Q4_Q1: the ratio between transactional amount of first quarter and the same amount for fourth quarter. Hence, a value smaller tha 1 means that the customer has spent less in this quarter with respect to the last one;
    - Total_Trans_Ct, Total_Ct_Chng_Q4_Q1: their meaning is analogous to the last two variables. Of course, thee differ on the underlying reference variable, since in this case it is the number of transactions instead of the amount;
    - Avg_Utilization_Ratio: $$Utilization\_Ratio=\frac{Credit\_Limit-Open\_To\_Buy}{Credit\_Limit}=\frac{Credit\_Used}{Credit\_Limit}$$
    Avg_Utilization_Ratio is the average proportion of the credit used with respect to the credit liit in the last 12 months.
    

There are two more features:
* **CLIENTNUM:** primary key of the dataset. For the sake of the project, it is not useful;
* **Attrition_Flag:** whether the customer is an attrited one or not. Attrited customer are referred to *closed* accounts, thus there might be clients *near to churn* in the dataset but classified as normal clients (this will cause some errors, like we will see below).  
We can observe that it is an unbalanced distribution (about 84% of the data belong to the class *Existing Customer*, whereas only 16% of clients are in the class of churned ones). This means that any standard machine learning algorithm will struggle to classify correctly the minority class. For more on this, please refer to [3. Full features Machine learning model vs Model with first level insights](#ML);

### Basic info

In [None]:
df['Attrition_Flag'].value_counts('')

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
print("\n\n" + "=" * 20 + "\n")
display(df.info())
print("\n" + "=" * 10 + " Unique & missing values" + "=" *10)
display(pd.DataFrame({"Uniques": df.nunique().values, "Missing": df.isna().sum(axis = 0).values}, index = df.columns))

<a id='2ndPart'></a>
## First level analysis: let's look which parameter features are directly impacting the churn

Instead of looking for corraletions between features, I wanted to start looking at features who have direct 1st level impact on the churn. 
I used a probabilistic concept to sort out this logic:


Take 2 events A and B that are independent, then **P(A inter B) = P(A) * P(B)[1]** resulting in **P(A|B) = P(A)[2]**

So if we plot the P(A) (for example is the customer Age histogram) the distribution shouldnt vary too much if we plot P(A) for "existing customer" or for "Attrited ones"(meaning P(A|B) where B here is the label: existing or attrited customer) 

So what I'm gonna do is plot P(A inter B)(histogram) (B is Existing and B- Attrited) and the only difference we will see in distributions **IF THE FEATURES ARE INDEPENDENT** is in the y counts because in the dataset the repartition is P(B) ~= 84% and P(-B) = 16%(see formula [1]) 

In resume what I'm saying is we should have the same population repartition when plotting
P(A inter B) and P(A inter -B)

For example with the **A = Gender Male and -A = Female** and **-B = churn , B = Existing** we should have 
(A inter B) and (A inter -B) get the same population repartition of churn and existing customer meaning respectively ~16% of Male customer are churning and ~84% are still existing

Like I said in terms of distribution over possible values of the features the normalized distribution for (Feature inter B) should look the same with (Feature inter -B) 
If there are differences in the 2 distributions then the churn is **Not independent of the the feature** . Then this feature can be a good predictor to sort out which client will possibly churn. 

One way to measure the difference between the 2 distribution is by doing the **Student's T-test** which allow us to determine if there is a significant difference between the mean of 2 distributions. 

In [None]:
num = 0
fig,axs = plt.subplots(7,3,figsize=(50,100),edgecolor='k')
axs = axs.ravel()
for i in cols[2:-2]:
    sns.set_style('ticks')
    sns.histplot(df.loc[df['Attrition_Flag']=='Existing Customer' , i], 
                 label = 'Existing', kde = True , 
                 line_kws=dict(linewidth=6), ax = axs[num])
    sns.histplot(df.loc[df['Attrition_Flag']=='Attrited Customer' , i], 
                 label = 'Attrited', kde = True , 
                 line_kws=dict(linewidth=6), ax = axs[num],color='orange')
    #sns.histplot(df, x=i,hue=cols[1], kde=True,line_kws=dict(linewidth=6),stat = 'probability', ax = axs[num])
    
    plt.setp(axs[num].get_title(), fontsize='25')
    axs[num].set_xlabel(i,fontsize=40)
    axs[num].set_ylabel('count',fontsize=30)
    axs[num].tick_params(axis='both',labelsize = '30')
    num+=1

## Analysis

We can see that for the "Customer_Age" (fig(1,1)), the class label seems independent of it. there is no apparent useful information from it. WHY ? Because the distribution of customers in the 2 class looks the same(**the modes seem to match in x axes**), except in the vertical counts(**reasoning from logic explained earlier**). But this looks normal because of the initial repartition of labels between "Existing customer" and "Attrided Customer". The repartition was 84%/16% respectively. 

So for fig(1,1) to fig(4,3) no feature dependence

Being in this strategy(looking for modal discrepancy between distributions), we can infer that the most important features that have first level impact on the churn are :
* Avg_utilization_Ratio
* Total_Trans_Ct (pretty clear multimode vs unimodal distribution)
* Total_Trans_Amt
* Toal_Ct_Chang_Q4_Q1
* Total_Amt_Chang_Q4_Q1
* Total_Revolving_Bal

## Conclusion:
The precedent analysis is only for first level dependencies. What I mean by that is that it is possible to have 3rd order dependencies or group ones meaning when you combine different unimportat features they have a effect on the churning ! 
But my data exploration at least help us to understand first level effect of important features. Better start with approximations and refining the model after.  