## EDA of bank churn dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df =  pd.read_csv('../input/predicting-churn-for-bank-customers/Churn_Modelling.csv')

In [None]:
df.info()

### There are no null values.

In [None]:
df.head()

### Let us drop RowNumber and Surname

In [None]:
df.drop(['RowNumber', 'Surname'], axis=1, inplace=True)

In [None]:
df.to_csv('churn_cleaned.csv', index=None)

In [None]:
df.head()

In [None]:
df = df.sort_values('CustomerId')

In [None]:
df.head()

In [None]:
df['CustomerId'].is_monotonic_increasing

### We will drop the customer ID column befolre EDA, as this column just represents a unique identifier for each customer in our dataset.

In [None]:
custid = df['CustomerId']
df = df.drop('CustomerId', axis=1)

### Let us see the count of target variable (0 or 1).

In [None]:
df['Exited'].value_counts()

##  How to explore the data ?

Questions to ask

- What is the average credit score of exited customers and retained customers ?
- Which geographies are more common in exited customers ?
- DO exited customers have a higher or lower salary compared to retained customers ?
- What is the ratio of credit card holder in each category ?
- How is credit score distributed for the credit-card holders in each category ?
- What is the average balance of each category ?
- What is the average balance of credit-card holders and non-holders, overall ? How is it distributed across combination of credit card category and churn category ?
- Customers of which geography are more likely to avail credit card ?
- Do credit card holders have a higher or lower salary than non-holders ?
- How is average salary distributed across geographies ?
- How is the ratio of exited customers distribued by gender ?
- Average salary distribution by gender ?
- Is gender a factor in availing credit card ?
- Distribution of gender-exited-salary?
- age vs churn ? how is the variability in the age of exited vs retained ? (std())
- distribution of 'Exited' by NumberOfProducts ?
- 'Exited'vs IsActiveMember
- Exited vs estimated_salary
- salary vs gender?
- Important -> Salary vs balance and Salary vs CreditScore

###  Does gender determine churn ratio ?

In [None]:
plt.figure(figsize=(6,6))
(df['Gender'].value_counts() * 100 / df['Gender'].value_counts().sum()).plot(kind='bar')

#### About 55% customers in the dataset are male. We will check if this ratio is maintained across other features.

In [None]:
plt.rcParams['figure.figsize'] = (6,6)

In [None]:
(df.groupby('Gender')['Exited'].mean()*100).plot(kind='bar')
plt.ylabel('Churn Percentage')

### Overall, females have a higher churn ratio compared to males.

In [None]:
sns.catplot(x = 'Exited', hue= 'Gender', data= df, kind='count')
plt.xticks(range(2), ['NO', 'YES'])
plt.ylabel('Number of Customers');

### In retained customers, the number of males is more.

In [None]:
df['Gender'].value_counts()

### Does credit score determine churn?

In [None]:
df.groupby('Exited')['CreditScore'].mean()

In [None]:
plt.figure(figsize=(6,8))
sns.boxplot(x='Exited', y='CreditScore', data=df)

- The median credit scores of churned and retained customers is almost the same. 
- Exited customers seem to have the lower minimum credit score.

In [None]:
sns.displot(x='CreditScore', hue='Exited', data=df)

- The credit scores of both churned and retained customers follows a normal-like distribution. 
- As deduced in the previous plot, some of the lowest credit scores in the dataset are those of churned customers.

- Let us divide the credit score into 3 categories as follows:
1. Poor = Upto score 550
2. Average = 550 < Score <= 700
3. Good = Score > 700

In [None]:
df['Score'] = pd.cut(df['CreditScore'],
                    bins=[0,550,700,900],
                    labels=['Poor', 'Average', 'Good'])

In [None]:
sns.countplot(x=df['Score'])

- Most customers in the dataset have an average score.

In [None]:
sns.countplot(x='Exited', hue='Score', data=df)

- The same trend is maintained in churned as well as retained customers.

### Which geographies are more common in exited customers ?

In [None]:
# let us first look at the overall distribution of Geography

plt.figure(figsize = (6,6))
(df['Geography'].value_counts()/df['Geography'].value_counts().sum()*100).plot(kind='bar')
plt.ylabel('Percentage out of total samples')
plt.xlabel('Geography')
plt.title('Geographical share within exited customers');

- About 50% of the customers are from France. 
- Germany and Spain have an almost equal share of 25%.

In [None]:
ct = pd.crosstab( df['Exited'], df['Geography'])
ct

- Spain has the lowest count of exited customers. The counts of France and Germany are practically equal.

In [None]:
plt.figure(figsize=(6,6))
(ct.loc[1] * 100.0 / ct.sum()).plot(x=ct.index, y=ct.values, kind='bar')
plt.ylabel('Churn percentage within geographical group')
plt.title('Churn by geography')

### Is there a difference in the estimated salaries of exited and retained customers ?

In [None]:
df.groupby('Exited')['EstimatedSalary'].mean()

In [None]:
plt.figure(figsize=(6,8))
sns.boxplot(x='Exited', y='EstimatedSalary', data=df)

- No significant difference seems to exist in the estimated salaries of both groups.

### Customers of what age are more likely to leave ?

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(x='Exited', y='Age', data=df)

#### Let us bin 'Age' into 4 categories

In [None]:
df['Age'].min(), df['Age'].max()

In [None]:
df['AgeCat'] = pd.cut(df['Age'],
                     bins=[17,35,50,65,93],
                     labels=['Young', 'Middle-aged', 'Senior', 'Very-old'])

In [None]:
sns.countplot(x='AgeCat', data=df)

In [None]:
pd.crosstab(df['AgeCat'], df['Exited']).plot(kind='bar', stacked=True, figsize=(6,6))

####  It seems middle-aged and senior customers have a higher tendency of leaving.

### Does holding a credit card play a role in churn ?

In [None]:
df['HasCrCard'].value_counts()

#### About 70% of all customers in the dataset have credit cards.

In [None]:
df.groupby('Exited')['HasCrCard'].mean()

#### The overall ratio of credit card holders of 70% is maintained in churned as well as retained customers.

In [None]:
df.groupby('HasCrCard')['Exited'].mean()

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='HasCrCard', hue='Exited', data=df)
plt.title('CREDIT CARD HOLDER vs CHURN');

### What is the average credit score of credit card holders, that of not having credit card ?

In [None]:
df.groupby('HasCrCard')['CreditScore'].mean()

#### Both categories of customers have a very similar credit average score. 
#### Lets see the distribution.

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(y='CreditScore',x='HasCrCard', data=df)
plt.title('CREDIT SCORE vs CREDIT CARD HOLDING')

- Having a credit card does not seem to affect credit score.

### Does geography determine credit card adoption ?

In [None]:
plt.figure(figsize=(10,6))
sns.catplot(x='Exited', hue='HasCrCard', col='Geography', data=df, kind='count')

### What is the churn behavour by Geography and CreditCard adoption ?

In [None]:
sns.catplot(x='Geography', hue='Exited', col='HasCrCard', data=df, kind='count')

### What is the average balance for each category ?

In [None]:
sns.displot(x='Balance', data=df)
plt.title('Distribution of balance');

In [None]:
df['Balance'].describe()

### There seems to be huge variability in the distribution of Balance.

In [None]:
(df['Balance'] == 0).sum()

### 36% of the customers have zero balance. This may be due to inactive/frozen accounts, abandoned accounts, etc.

In [None]:
df.groupby('Exited')['Balance'].mean()

#### Exited customers clearly seem to have a greater balance on average, compared to retained customers. 
#### Let us see the distribution in customers having non-zero balance.

In [None]:
plt.figure(figsize=(6,7))
sns.boxplot(x='Exited', y='Balance', data=df)
plt.title('CHURN vs BALANCE')

#### Among customers with non-zero balances, the low-balance customers are more widely present in retained customers.

In [None]:
bal_non_zero = df.loc[df['Balance']>0]

In [None]:
(df['Balance'] < 0).sum()

In [None]:
sns.displot(x='Balance', hue='Exited', data=bal_non_zero)
plt.title('Non-zero balances');

#### A balance between 100000 and 130000 seems to be most common.

In [None]:
bal_non_zero.groupby('Exited')['Balance'].mean()

#### In this case, the balances are similar.

In [None]:
plt.figure(figsize=(6,7))
sns.boxplot(x='Exited', y='Balance', data=bal_non_zero)
plt.title('NON-ZERO BALANCES vs CHURN')

### How does balance vary for credit card holders by retained and exited customers ?

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(x='Exited', y='Balance', hue='HasCrCard', data=df)
plt.legend(title = 'Has credit card', loc=0)

###  Do balance and  credit score have a relationship ?

In [None]:
plt.figure(figsize=(7,6))
sns.jointplot(x='Balance', y='CreditScore', data=df, hue='Exited')

#### No significant relationship seems to be visible between balance and credit score.

### Balance vs estimated salary ?

In [None]:
sns.jointplot(x='EstimatedSalary', y='Balance', data=bal_non_zero, hue='Exited')

#### No relationship seems to exist.

###  Salary vs credit score?

In [None]:
sns.jointplot(x='CreditScore', y='EstimatedSalary', data=df, hue='Gender')

####  Again, no clear trend is visible. The distribution seems to be completely random.

###  How does estimated salary vary for different geographies ?

In [None]:
plt.figure(figsize=(6,7))
sns.boxplot(x='Geography', y='EstimatedSalary', data=df)

####  The values are practically the same.

### Do males and females have a different median salary ?

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(x='Gender', y='EstimatedSalary', data=df)

####  Males in the dataset seem to have a slightly lower salary compared to females. This difference does not seem to be significant.

###  Credit card adoption by gender ?

In [None]:
c = pd.crosstab(df['Gender'], df['HasCrCard'])
c

In [None]:
(df.groupby('Gender')['HasCrCard'].mean() * 100).plot(kind='bar')

####  Males and females have equal credit card adoption ratio.

###  Does length of the relationship with the bank play a role in churn ?

In [None]:
sns.countplot(x=df['Tenure'])

####  Overall, there is no clear trend in the relationship length of customers with the bank.

In [None]:
sns.countplot(x='Tenure', data=df, hue='Exited')

In [None]:
c = pd.crosstab(df['Exited'], df['Tenure'])

In [None]:
c.loc['P_1'] =  c.loc[1] * 100 / c.sum()
c.loc['P_0'] = c.loc[0] * 100 / c.sum()
c

In [None]:
c.loc[[0,1]].T.plot(kind='bar', stacked=True, figsize=(7,5))

###  Let us plot numerical columns together.

In [None]:
df.columns

In [None]:
sns.pairplot(df[['CreditScore', 'Age', 'Tenure', 'Balance',
                 'NumOfProducts', 'EstimatedSalary']])

####  Young people are more likely to avail multiple products from the bank compared to older people.
#### Customers with multiple products generally have a higher credit score.

###  Do active members leave less often than inactive members ?

In [None]:
c = pd.crosstab(df['IsActiveMember'], df['Exited'])
c

In [None]:
c.plot(kind='bar', stacked=True, figsize=(6,7))
plt.title('CHURN vs IS_ACTIVE_MEMBER')
plt.ylabel('Number of custmers')

####  As we can see, the inactive members are leaving more. This is as expected in a real business scenario.

###  Do exited customers avail less number of products of the bank ?

In [None]:
df['NumOfProducts'].value_counts()

In [None]:
df['NumOfProducts'] = df['NumOfProducts'].astype('category')

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='NumOfProducts', hue='Exited', data=df)
plt.title('CHURN vs NUMBER OF PRODUCTS TAKEN')

####  Customers who have taken less number of products have a lower churn ratio.