In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.options.display.max_rows = None
pd.options.display.max_columns = None

**Loading the dataset

In [None]:
cp = pd.read_csv("../input/churn-modelling/Churn_Modelling.csv")

In [None]:
cp.head()

In [None]:
cp.info()

**Total No. of columns are 14
No Null values exist in any of the column

In [None]:
columns = cp.columns.values.tolist()
print(columns)

##**Not all columns affect the customer churn

Column Description:

RowNumber - corresponds to the record (row) number and has no effect on the output.

CustomerId— contains random values and has no effect on customer leaving the bank. However, it is unique identifier hence we will keep this column and can drop it before EDA.

Surname—the surname of a customer has no impact on their decision to leave the bank.

CreditScore—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.

Geography—a customer's location can affect their decision to leave the bank.

Gender— Using this we can analyse if gender is important factor to predict the customer leaving the bank

Age— Age is relevant, since older customers are less likely to leave their bank than younger ones.

Tenure— Number of years that the customer has been a client of the bank. Normally, older customers are more loyal and less likely to leave a bank.

Balance— Good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.

NumOfProducts— Number of products that a customer has purchased through the bank.

HasCrCard— Whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.

IsActiveMember—active customers are less likely to leave the bank.

EstimatedSalary—People with lower salaries are more likely to leave the bank compared to those with higher salaries.

Exited— Whether or not the customer left the bank

Based on the above observations of the column features, We will remove the RowNumber and Surname column as they have no impact on the output. All remaining columns do contribute to the customer churn in one way or another

In [None]:
cp.drop(['RowNumber', 'Surname'], axis=1, inplace=True)

In [None]:
cp.head()

### Univariate Variable Analysis

**Categorical variable :

Geography
Gender
NumOfProducts
HasCrCard
IsActiveMember
Exited


**Numerical variable:

CustomerId
CreditScore
Age
Tenure
Balance
EstimatedSalary


In [None]:
def plot_bar(variable):
    
    var=cp[variable]
    varValue= var.value_counts()
    
    #visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index,varValue)
    plt.xticks(varValue.index,varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category1=["Gender","NumOfProducts","HasCrCard","IsActiveMember","Exited"]
for c in category1:
    plot_bar(c)

#### In this dataset, the number of male customers is slightly higher than the number of female customers

#### The number of customers using 3 or 4 products is very low.

#### The number of customers who have a credit card is almost 2.5 times the number of customers without a credit card, so the number of customers who use the bank's credit card is around 70% of all customers.

#### The number of active customers is higher than inactive customers, but there is no obvious difference, their values are close to each other.

#### Too many customers leaving the bank, around 80% of all customers left the bank.


In [None]:
category2=["Geography"]
for c in category2:
    plot_bar(c)

**50% of the customers are in France, around 25% in Germany, and around 24% in Spain.

## Numerical Variables

In [None]:
def plot_hist(variable):
    plt.figure(figsize=(9,3))
    plt.hist(cp[variable],bins=50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    
    plt.title("{} distribution with histogram".format(variable))
    plt.show()

In [None]:
numerical1=["CustomerId"]
for n in numerical1:
    plot_hist(n)

**Since the CustomerId variable is not meaningful numeric variables. We should not include it in analysis.

In [None]:
cp.drop(['CustomerId'], axis=1, inplace=True)

In [None]:
cp.shape

In [None]:
numerical2=["CreditScore","Age","Tenure","Balance","EstimatedSalary"]
for n in numerical2:
    plot_hist(n)

In [None]:
cp.CreditScore.mean()

# CreditScore variable

**Few customers have less than 600 credit points.

**The majority of its customers have a credit score of more than 600.

**The mean of the credit score is 650.

**As seen in the chart, the number of customers with a credit score less than 600 is small, after 600 the number of customers has increased, so we can say that there is a distorted distribution to the left.

**The number of customers with a credit score of more than 800 increased at one point.

In [None]:
cp.Age.mean()

# Age variable:

**Fewer customers are over the age of 45. 

**There are even fewer customers in the age group of 60 to 80 

**There are very few customers who are around 90 years old, where it can be outlier.

**We can say that there is a skewed distribution to the right because there is a significant decrease in the number of customers older than the average age i.e. 38.92

# Tenure variable

**Clearly shows there are 11 categories in length of relationship

In [None]:
print(cp.Balance.max())
print(cp.Balance.min())
print(cp.Balance.mean())
print(cp.Balance.std())

# Balance variable

**The number of customers with a balance status of 0 is too high.

**There was an increase between 50,000 and 100,000, and then it decreased.

**The standard deviation is high, that is, the balance variable is heterogeneously distributed.

**There is a gap between the customers in terms of balance maintained by the customer
(as can be seen from the max and min values).

In [None]:
print(cp.EstimatedSalary.max())
print(cp.EstimatedSalary.min())
print(cp.EstimatedSalary.mean())
print(cp.EstimatedSalary.std())

# EstimatedSalary variable

**There is a heterogeneous distribution in estimated salaries, customers' salaries differ, 
but there is no obvious difference in the number of customers who receive the same salary.

# EDA

In [None]:
sns.pairplot(cp, hue = 'Exited')

In [None]:
f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(cp.corr(),annot=True,fmt= ".4f",ax=ax)
plt.show()

#### The above data says the Age, Balance, and Estimated Salary positively correlates with Exited which means with increase in their value probability of exited being 1 is more

#### whereas  creditscore,Tenure, NumofProducts, HsCrCard, isACtive member negatively correlates with Exited which means that increase in the values of these variables reduces the chances of exiting

In [None]:
cp[["Age","Exited"]].groupby(["Exited"],as_index=False).mean().sort_values(by="Age",ascending=False)

#### The average age of those who left is 44.8 while the average age of those who did not leave is 37.4. So the age of the people who leave is generally higher.

In [None]:
cp[["IsActiveMember","Exited"]].groupby(["IsActiveMember"],as_index=False).mean().sort_values(by="IsActiveMember",
                                                                                              ascending=False)

#### 14% of active members and 26% of inactive members left the bank.

In [None]:
cp[["NumOfProducts","Exited"]].groupby(["NumOfProducts"],as_index=False).mean().sort_values(by="NumOfProducts",ascending=False)

#### Customers with 1 or 2 products are more likely to churn

In [None]:
sns.countplot(x="Geography",hue="Exited",data=cp)

#### In France, Spanin and Germany, the number of customers who have not left is much higher than those who have exited, but the biggest difference between exited and not existed is in France.

In [None]:
sns.countplot(x="Gender",hue="Exited",data=cp)

#### Female customers are more likely to churn than male customers

In [None]:
sns.countplot(x="IsActiveMember",hue="Exited",data=cp)

#### Inactive customers are more likely to churn

In [None]:
plt.scatter(cp.EstimatedSalary,cp.CreditScore,alpha=0.1,color="r")
plt.xlabel("EstimatedSalary")
plt.ylabel("CreditScore")
plt.show()

#### People with high salaries rarely have a low credit score.Usually there is a people density between 500 and 700 credit points

In [None]:
plt.scatter(cp.Age,cp.EstimatedSalary,alpha=0.1,color="g")
plt.xlabel("Age")
plt.ylabel("EstimatedSalary")
plt.show()

#### The density of people between the ages of 26-42 is high. Few Older people at the age of 80 also have a high salary

In [None]:
plt.scatter(cp.Balance,cp.Age,alpha=0.1,color="b")
plt.xlabel("Balance")
plt.ylabel("Age")
plt.show()

#### Balance values of people between the ages of 26-42 are between 60000-170000

In [None]:
plt.figure(figsize=(6,8))
sns.boxplot(x='Exited', y='EstimatedSalary', hue = "Exited", data=cp)

#### Difference in estimated salary of exited and not existed customer is not visible

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(x='Exited', y='Age',hue = "Exited", data=cp)

#### More churn is visbile in the age group of 40 to 50

In [None]:
cp['Age'].min(), cp['Age'].max()

In [None]:
cp['AgeGrp'] = pd.cut(cp['Age'],
                     bins=[18,30,50,65,99], labels=['18-30','31-50', '51-65','Above65'])

In [None]:
cp['AgeGrp'].value_counts()

In [None]:
sns.countplot(x='AgeGrp', data=cp)

In [None]:
pd.crosstab(cp['AgeGrp'], cp['Exited']).plot(kind='bar', stacked=True, figsize=(6,6))

#### Data showing rate of customer exit is higher in the age group of 31 to 50 years

In [None]:
plt.figure(figsize=(6,7))
sns.boxplot(x='Exited', y='Balance',hue ="Exited", data=cp)
plt.title('CHURN vs BALANCE')

#### Customers who are maintaining low balance are also likely to churn

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(x='Exited', y='Balance', hue='HasCrCard', data=cp)
plt.legend(title = 'Has credit card', loc=0)

#### Above graph shows variation of balance for credit card holders/non holders by exited and retained customers

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='NumOfProducts', hue='Exited', data=cp)
plt.title('CHURN vs NUMBER OF PRODUCTS TAKEN')

#### Customers with low number of products shows higher churning ratio

In [None]:
pd.crosstab(cp['IsActiveMember'], cp['Exited']).plot(kind='bar', stacked=True, figsize=(6,6))

#### Inactive customers are more likely to churn