# Exploratory Data Analysis - Bank Customer Attrition

## Objectives

Explore the Bank Customer Attrition dataset to investiagte and carry out the following:

* Identify Outliers
* Explore feature engineering
* Descriptive Analysis
* Statistical Analysis

## Inputs

* data/inputs/cleaned_bank_data.csv

## Outputs

* INSERT CSV NAME HERE

___________________

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
# Load cleaned dataset
df = pd.read_csv('../data/inputs/cleaned_bank_data.csv')

# Display first row of DataFrame
df.head(1)

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,0,1,15598695,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464


_____________________

## Statistical Analysis

Explore the basic statistics from the cleaned datatset.

In [21]:
# Display basic statistics of the DataFrame
categorical_stats = df.describe(include='object')
display(categorical_stats)

Unnamed: 0,Geography,Gender,Card Type
count,10000,10000,10000
unique,3,2,4
top,France,Male,DIAMOND
freq,5014,5457,2507


**Observations:**

* **Geography** - There are three unique entries within the column, which reveals the dataset based on customers across 3 countries. With the most customers living in France, as French customers make up 5014 of the 10000 entries. 

* **Gender** - The initial exploration reveals there are a greater number of male customers than female customers. The male customer population makes up 5457 of the 10000 entries.

* **Card Type** - There are 4 differnet card types, with DIAMOND being the most popular among all customers. The DIAMOND card holder population makes up 2507 of the 10000 entries. 

Investiagte the mean, median, standard deviation and range of the dataset.

In [23]:
# Perform basic statistical analysis
summary_stats = df.describe()
styled_summary_stats = summary_stats.style.background_gradient(cmap='Blues')
display(styled_summary_stats)

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Point Earned
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4999.5,5000.5,15690940.5694,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2038,0.2044,3.0138,606.5151
std,2886.89568,2886.89568,71936.186123,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402842,0.403283,1.405919,225.924839
min,0.0,1.0,15565701.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0,0.0,1.0,119.0
25%,2499.75,2500.75,15628528.25,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0,0.0,2.0,410.0
50%,4999.5,5000.5,15690738.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0,0.0,3.0,605.0
75%,7499.25,7500.25,15753233.75,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0,0.0,4.0,801.0
max,9999.0,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0,1.0,5.0,1000.0


**Analysis Summary**:

This summary is based on the numeric columns relevant to the 3 hypotheses for the project.

**CreditScore:**

* **Mean** = 650.52

* **Median** = 652

* **Standard Deviation** = 96.65. This value suggest there is a great amount of variation between credit scores, which are both significantly above and below the mean credit score. 

* **Range** = 350 - 850. The range reflects the distirbution of credit scores within the dataset.


**NumOfProducts:**

* **Mean** = 1.53

* **Median** = 1.00

* **Standard Deviation** = 1.00. This value is low which suggests there isn't much variability within this column (as expected as there are only 4 unique values).

* **Range** = 1 - 4. The range indicates that there are 4 different card types for a customer to hold. 


**PointsEarned:**

* **Mean** = 606.51

* **Median** = 605

* **Standard Deviation** = 225.92. This value is very high which suggest there is significant variability among the amount of points a customer has earned. 

* **Range** = 119 - 1000. The reflects the vast distributions of points earned among the dataset population. 

## Explore Feature Engineering

Feature engineering will be explored in the notebook to identify the best strategies, to be implemented into the ETL pipeline.

Create a new column to group continuous age values for improved interpretability, enhanced analysis and simplified modelling.

Group age into the following bins:

* 18-29
* 30-39
* 40-49
* 50-59
* 60-69
* 70 +

In [15]:
# Group age into bins
# Define the bins and labels for age groups
bins = [ 18, 30, 40, 50, 60, 70, np.inf]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']

# Create a new column 'AgeGroup' in the DataFrame (CoPilot Assitance)
df ['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)


In [20]:
# Display the first few rows of the DataFrame to verify the new column
df.head()

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned,AgeGroup
0,0,1,15598695,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464,40-49
1,1,2,15649354,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456,40-49
2,2,3,15737556,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377,40-49
3,3,4,15671610,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350,30-39
4,4,5,15625092,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425,40-49


_______________

## Outlier Detection