<h1 style='background-color: #fed8b1; padding-bottom:4px;' align='center'><b>Bank Churn EDA & Modelling</b></h1>

    We have been given the task of predicting the customers that are likely to churn so that the bank can proactively take action to prevent it from happening.

    In this notebook, I have performed some exploratory data analysis.

    I hope you find my notebook useful and make sure to upvote if you enjoyed it!

<h2 style='background-color: #fed8b1;' align='center'>Loading Libraries</h2>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from pandas_profiling import ProfileReport

from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

<h2 style='background-color: #fed8b1;' align='center'>Data Loading & Basic Analysis</h2>

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')

In [None]:
df.head()

In [None]:
df.info()

<p>We are off to a great start as there appears to be no missing values!</p>

In [None]:
for c in df.select_dtypes('object').columns:
    print(df[c].value_counts())
    print('--------------------')

    This is a cool trick I like to use that basically lists all the non-integer values and shows their respective value_counts. 

    We see that here we will more than likely need to LabelEncode all these features except the Gender, which would be an one-hot encoding. 

    However, some libraries such Catboost, can actually deal with text values by itself!

In [None]:
df.describe()

    Here we are getting some basic statistics about the dataset to get a feel for what we are going to be working with.

    Immediately we see that our features are on different scales. 

    For instance, the median value for Total_Relationship_Count is 4, but the median value for the Credit_Limit feature is 4549!

    This shows us that we will require feature scaling if we choose to work with Linear models or Neural Networks.

<h2 style='background-color: #fed8b1;' align='center'>Exploratory Data Analysis</h2>

In [None]:
ProfileReport(df,minimal=True)

    This is probably my new favorite tool for EDA. It's really simple but packs a lot of information in one line of code!

    The main reason I enjoy using this library is because it given me a starting point as to where I should start my data analysis.

    I have set mininal=True because it does use up a substantial amount of CPU, but nevertheless it is a useful tool to have in the toolkit!

In [None]:
fig = px.pie(df,df['Attrition_Flag'],hole=0.5)
fig.update_layout(title='Percentage of Attrited and Existing Customers',title_x=0.5)
fig.show()

In [None]:
sns.countplot(df['Attrition_Flag'])
plt.title('Boxplot of Attrited and Existing Customers')
plt.show()

In [None]:
df['Attrition_Flag'].value_counts()

<b>The key takeaways from here are:</b>

    1. the dataset is HEAVILY imbalanced. 

       This means that we will have to address this when modelling.

       We may have to use either Oversampling or Undersampling techniques to address this issue.

    2.When splitting the data into a train and test set, or when using Cross Validation, it is important to stratify the split
      so that the class distribtutions are  split proportionally to the original dataset's imbalance. 

      Specifically, if our data is imbalanced with only 16.1% belonging to the attrited class, we must reflect that in the split. 
      Not doing so is a common data leakage and gives an overly optimistic accuracy of the model when evaluating it. 

In [None]:
numeric_features = df.select_dtypes(exclude=['object'])

In [None]:
for feature in numeric_features:
    sns.boxplot(df[feature])
    plt.title('Boxplot of ' + str(feature))
    plt.show()

<b>Some key takeaways here</b>:

    1. As the author of the dataset suggested, we should 100% remove the last 2 columns as they will jeopardize the accuracy of the model 
       and follow no clear distribution.

    2. Some features, such as Total_Trans_Amt and Credit_Limit, have a few outliers in them. 
       This may hinder our model's performance and we should most likely deal with the outliers

In [None]:
df.skew()

    While our features are not overly skewed, we have a couple of left tailed and right tailed skeweness present in our features. 
    Let's begin by analysing the Credit_Limit feautre.

In [None]:
sns.distplot(df['Credit_Limit'])
plt.title('Distribution of the Credit_Limit feature')
plt.show()
print(df['Credit_Limit'].skew())
print(df['Credit_Limit'].kurt())

    So it seems that this feature fits inside a left-tailed distribution.
    However, note that if we could possibly transform this feature to reduce this skewness

In [None]:
sns.distplot(df['CLIENTNUM'],bins=40)
plt.title('Distribution of the CLIENTNUM feature')
plt.show()
print(df['CLIENTNUM'].skew())
print(df['CLIENTNUM'].kurt())

    This feature is slightly different. It appears to somewhat follow a left tailed normal distribution, 
    but it is clear that there is a smaller distribution on the right, implying a high standard deviation
    We also observe that this feature's kurtosis is ~ -0.62, meaning our distribution is light-tailed.
    This means that exteme values will occur less often 

In [None]:
sns.barplot(df['Attrition_Flag'],df['Credit_Limit'])
plt.show()

    We see that Existing customers are likely to have a slightly larger Credit_Limit that Attrited Customers, but the narrow gap implies that this feature is
    not a definitive one.

In [None]:
sns.barplot(df['Attrition_Flag'],df['Total_Revolving_Bal'])
plt.show()

In [None]:
plt.figure(figsize=(10,10))
age_groups = pd.cut(df['Customer_Age'],[20,30,40,50,60],labels=['0-20','20-30','30-40','40-50'])
sns.barplot(age_groups,df['Total_Revolving_Bal'],hue=df['Attrition_Flag'])
plt.show()

    Now we see a difference. Existing customers are more likely to have larger Revolving Balances than Attrited Customers. This is evident through all age groups

In [None]:
sns.barplot(df['Attrition_Flag'],df['Total_Trans_Amt'])
plt.show()

In [None]:
sns.barplot(age_groups,df['Total_Trans_Amt'],hue=df['Attrition_Flag'])
plt.show()

    We can observe that Existing customers are more likely to have a higher transaction amount than attried customers.However, 
    the 0-20 bracket actually has more attrited customers with higher transaction amounts than existing customers. 

    We can see that the total transaction amount of existing customers in different Income_Categories is relatively uniform, 
    with the $120K+ category being slightly larger. However, on the attrited customers, the 80K-120K Group is the highest group of attrited customers

In [None]:
df.select_dtypes('object').columns

In [None]:
fig = px.pie(df,df['Card_Category'],hole=0.5)
fig.update_layout(title='Card Category of customers',title_x=0.5)
fig.show()

In [None]:
sns.barplot(df['Attrition_Flag'],df['Total_Revolving_Bal'],hue=df['Card_Category'])
plt.show()

    In these two visualisations, we make 2 observations:
    
    1. The main card category of all customers(existing and churned) is blue. However, when we look at the data closely, we see that churned customers are actually more likely to be holding a Gold card rather than a blue card

In [None]:
fig = px.pie(df,df['Education_Level'],hole=0.5)
fig.update_layout(title='Education level of customers',title_x=0.5)
fig.show()

In [None]:
plt.figure(figsize=(15,15))
sns.barplot(df['Attrition_Flag'],df['Total_Revolving_Bal'],hue=df['Education_Level'])
plt.show()

    We see that the primary education level of customers is graduate, but the education level does not greatly impact the likeliness of a customer churning

In [None]:
df_corr = df.copy()
df_corr['Attrition_Flag'] = df_corr['Attrition_Flag'].map({'Existing Customer':0,'Attrited Customer':1})

plt.figure(figsize=(20,20))
sns.heatmap(df_corr.corr(),annot=True,cmap='plasma')
plt.show()

    We can observe that there is no real features that correlate with the target, but that does not mean that the features are useless; our analysis
    showed that some features do have an impact on the likeliness of the customer to churn, 
    so it is not correct to solely base your usefullness for a feature on a correlation heatmap