# Data Mining

#### Let's assume the given data is for a particular date and time as there is no timewise distribution of data.
#### Our main concern is to find factors that lead to customer churning. So there are few questions that should be asked before stating Exploratory Data Analysis and I will answer them at the end.
Questions:
1. What age group is the most leaving the bank?
2. Was it male or female for a particular age group who churned the most?
3. How does the activeness of member with bank affect the churning?
4. Most customers left of which country?
5. Anything different in range or values of other variables that affect churning?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
file_1 = pd.read_csv('../input/churn-prediction-of-bank-customers/Churn_Modelling.csv')

In [None]:
df = pd.DataFrame(file_1)

In [None]:
df.shape

In [None]:
df.set_index('CustomerId', inplace=True)
df.head()

In [None]:
df.describe()

In [None]:
#Parting the data into two dataframes containing exited and non exited customers.
df_e0 = df.loc[df.Exited == 0, :]
df_e1 = df.loc[df.Exited == 1, :]

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.distplot(df_e0['Age'])
plt.ylabel('PDF')
plt.title('Still with Bank')
plt.subplot(2,2,2)
sns.distplot(df_e1['Age'])
plt.ylabel('PDF')
plt.title('Left the Bank')
plt.subplot(2,2,3)
sns.boxplot(x= df.Exited, y= df.Age, hue=df.Exited)
plt.subplot(2,2,4)
sns.boxplot(x= df.Gender, y= df.Age, hue=df.Exited)

### Inference
The age of people left the bank has close to Normal distribution with maximum between 40 and 50.

In [None]:
df.Age.median(), df_e1.Age.median()

In [None]:
sns.countplot(df.Gender, hue=df.Exited)

### Inference
More female have left the bank than the male. The percentage of females left the bank is much higher than percentage of men who have left the bank.

In [None]:
perc_f_e1 = np.sum(df_e1.Gender == 'Female')*100 / np.sum(df.Gender == 'Female')
perc_m_e1 = np.sum(df_e1.Gender == 'Male')*100 / np.sum(df.Gender == 'Male')
perc_f_e1, perc_m_e1

In [None]:
sns.countplot(df.IsActiveMember, hue=df.Exited)

### Inference
More people have left who are not active member than who are active.

In [None]:
perc_a0_e1 = np.sum(df_e1.IsActiveMember == 0)*100 / np.sum(df.IsActiveMember == 0)
perc_a1_e1 = np.sum(df_e1.IsActiveMember == 1)*100 / np.sum(df.IsActiveMember == 1)
perc_a0_e1, perc_a1_e1

In [None]:
sns.countplot(df.Geography, hue=df.Exited)

### Inference
It seems most people left are from Germany. Or we can say the highest percentage of people left by Geography is from Germany.

In [None]:
perc_gf_e1 = np.sum(df_e1.Geography == 'France')*100 / np.sum(df.Geography == 'France')
perc_gs_e1 = np.sum(df_e1.Geography == 'Spain')*100 / np.sum(df.Geography == 'Spain')
perc_gg_e1 = np.sum(df_e1.Geography == 'Germany')*100 / np.sum(df.Geography == 'Germany')
perc_gf_e1, perc_gs_e1, perc_gg_e1

In [None]:
df.groupby(['Geography','Gender']).mean()

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.boxplot(x= df.Exited, y= df.Balance, hue=df.Exited)
plt.subplot(2,2,2)
sns.boxplot(x= df.Gender, y= df.Balance, hue=df.Exited)
plt.subplot(2,2,3)
sns.boxplot(x= df.Geography, y= df.Balance, hue=df.Exited)
plt.subplot(2,2,4)
sns.boxplot(x= df.Geography, y= df.Balance, hue=df.Gender)

### Inference
A lot of females who have not exited have balance of Zero.
France and Spain has all the people with Zero Balance while Germans do not have zero balance either when they left or they are currently with the bank.
Females are having the most zero balance in Spain.

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.boxplot(x= df.Exited, y= df.Tenure, hue=df.Exited)
plt.subplot(2,2,2)
sns.boxplot(x= df.Gender, y= df.Tenure, hue=df.Exited)
plt.subplot(2,2,3)
sns.boxplot(x= df.Geography, y= df.Tenure, hue=df.Exited)
plt.subplot(2,2,4)
sns.boxplot(x= df.Geography, y= df.Tenure, hue=df.Gender)

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.boxplot(x= df.Exited, y= df.NumOfProducts, hue=df.Exited)
plt.subplot(2,2,2)
sns.boxplot(x= df.Gender, y= df.NumOfProducts, hue=df.Exited)
plt.subplot(2,2,3)
sns.boxplot(x= df.Geography, y= df.NumOfProducts, hue=df.Exited)
plt.subplot(2,2,4)
sns.boxplot(x= df.Geography, y= df.NumOfProducts, hue=df.Gender)

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.boxplot(x= df.Exited, y= df.EstimatedSalary, hue=df.Exited)
plt.subplot(2,2,2)
sns.boxplot(x= df.Gender, y= df.EstimatedSalary, hue=df.Exited)
plt.subplot(2,2,3)
sns.boxplot(x= df.Geography, y= df.EstimatedSalary, hue=df.Exited)
plt.subplot(2,2,4)
sns.boxplot(x= df.Geography, y= df.EstimatedSalary, hue=df.Gender)

### Inference
It seems that people left the bank has slightly more salary than who have not left the bank. But the difference seems small this can be a coincidence or sample error.

### We have seen the relations of Categoricals and Variables with Exited.
### Now we will try to see relation between variables.

In [None]:
df.head()

In [None]:
plt.figure(figsize=(14,12))
plt.subplot(2,2,1)
sns.regplot(x=df.Age, y=df.CreditScore)
plt.subplot(2,2,2)
sns.regplot(x=df.Age, y=df.Balance)
plt.subplot(2,2,3)
sns.regplot(x=df.Age, y=df.EstimatedSalary)
plt.subplot(2,2,4)
sns.boxplot(y=df.Age, x=df.HasCrCard, hue=df.HasCrCard)

In [None]:
df[df.Balance == 0].Exited.value_counts()

## We can now conclude our findings.

1. The median age of people whom the bank retained is 37 with most values between 30-40. The median age of people who exited the bank is 45 with most values between 40-50 and is normally distributed. To improve on this the bank has to provide special services to senior citizen.
2. It is female who left the bank more than the male. The percentage of female who left the bank is 25%, whereas for male its 16%. We can conclude that the bank does not make females comfortable. The bank will have to provide special female services.
3. The inactive members churned more with 27%. Only 14% active members churned. But active members churning shows that customers must have been unsatisfied for any new policy or some other bank provided a scheme with much better services
4. Most churning of customers happened in Germany with almost 32% of customers leaving the bank. Customers of Germany have the most salary and Balance. This means the bank is not able to cater the elite class with elite services.
5. In spain around half of the females have a balance of zero.

## Hypothesis for further investigation

1. Null: Age of people who left the bank and who did not are similar. Alternative: Not similar.
2. Null: Credit score of people who left the bank and who did not are similar. Alternative: Not similar.
3. Null: Balance of people who left the bank and who did not are similar. Alternative: Not similar.
4. Null: Estimated Salary of people who left the bank and who did not are similar. Alternative: Not similar.