# Hypothesis Testing

## Objectives

* Carry out hypothesis testing using statistical analysis techniques.

## Inputs

* data/inputs/cleaned_bank_data.csv

## Outputs

* data/inputs/cleaned_bank_data.csv (No changes to dataset)

_________________

The cleaned bank customer attrition dataset will be loaded into this notebook for hypothesis testing.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the cleaned dataset
df = pd.read_csv('../data/inputs/cleaned_bank_data.csv')
df.head() # Display the first few rows of the dataset

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,SatisfactionScore,CardType,PointEarned,AgeGroup
0,0,1,15598695,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464,40-49
1,1,2,15649354,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456,40-49
2,2,3,15737556,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377,40-49
3,3,4,15671610,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350,30-39
4,4,5,15625092,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425,40-49


In [5]:
# Drop the 'Unnamed: 0' column if it exists
df.drop(columns=['Unnamed: 0'], inplace=True, errors='ignore')

## **Hypothesis 1:** Using a T-Test

Customers who earn fewer reward points are more likely to attrite.

Reward points often reflect transaction activity and customer engagement. If customers earning fewer points are more likely to leave, it suggests that low engagement is a key driver of attrition. This insight could help the bank identify disengaged customers early and design targeted retention strategies.




In [7]:
# CoPilot Assistance
# Import necessary libraries for hypothesis testing
from scipy.stats import ttest_ind

# Split the dataset into two groups: customers who exited and those who did not
attrited_customers = df[df['Exited'] == 1]['PointEarned']  # Customers who left the bank
retained_customers = df[df['Exited'] == 0]['PointEarned']  # Customers who stayed

# Perform an independent T-Test to compare the mean reward points between the two groups
t_stat, p_value = ttest_ind(attrited_customers, retained_customers, equal_var=False)

# Display the results
print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: Customers who earn fewer reward points are more likely to attrite.")
else:
    print("Fail to reject the null hypothesis: No significant difference in reward points between attrited and retained customers.")

T-Statistic: -0.46372006220720496
P-Value: 0.6428801754239084
Fail to reject the null hypothesis: No significant difference in reward points between attrited and retained customers.


* The T-Test statistic is -0.4637 which indicates there isn't a substantial difference between the means of each variable. 

* The P-Value is 0.642 which is greater than the significance level of 0.05. Indicating there is no statistically significant difference and any variation in values may be down to random chance.

* So we fail to reject the null hypothesis. 

## **Hypothesis 2:** Using a Pearson Correlation 

Credit card customers with lower credit scores are more likely to attrite.

Lower credit scores may indicate financial instability or limited access to banking services. These customers might leave voluntarily due to dissatisfaction or be targeted for closure by the bank. Understanding this relationship can guide credit policy and customer support strategies.

In [8]:
# CoPilot Assistance
# Import necessary libraries
from scipy.stats import pearsonr

# Perform a correlation analysis between CreditScore and Exited
correlation, p_value = pearsonr(df['CreditScore'], df['Exited'])

# Display the results
print("Correlation Coefficient:", correlation)
print("P-Value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant correlation between credit scores and attrition.")
else:
    print("Fail to reject the null hypothesis: No significant correlation between credit scores and attrition.")

Correlation Coefficient: -0.026771448055269292
P-Value: 0.007422037242741072
Reject the null hypothesis: There is a significant correlation between credit scores and attrition.


* The correlation coefficent is is -0.0268 which indicates there is a weak negative correlation between Credit Score and Exited. 

* The P-Value is 0.0074 which is less than the significance level, suggesting the correlation is statistically significant. 

* Reject the null hypothesis as there is a statistically significant although weak negative correlation between the two variables. Suggesting a relationship exists but credit score alone is unlikely to be a strong predictor of attrition. 

## **Hypothesis 3:** Using a Mann-Whitney U Test

Customers with shorter tenure at the bank are more likely to attrite.

If newer customers are more prone to leaving, it may signal issues with onboarding, early experience, or unmet expectations. This insight is crucial for improving customer retention strategies during the first few months of engagement

In [9]:
# CoPilot Assistance
# Import necessary libraries for hypothesis testing
from scipy.stats import mannwhitneyu

# Split the dataset into two groups: customers who exited and those who did not
attrited_customers = df[df['Exited'] == 1]['Tenure']  # Customers who left the bank
retained_customers = df[df['Exited'] == 0]['Tenure']  # Customers who stayed

# Perform the Mann-Whitney U Test to compare the tenure between the two groups
u_stat, p_value = mannwhitneyu(attrited_customers, retained_customers, alternative='two-sided')

# Display the results
print("U-Statistic:", u_stat)
print("P-Value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: Customers with shorter tenure at the bank are more likely to attrite.")
else:
    print("Fail to reject the null hypothesis: No significant difference in tenure between attrited and retained customers.")

U-Statistic: 7955556.0
P-Value: 0.1729583374544681
Fail to reject the null hypothesis: No significant difference in tenure between attrited and retained customers.


* The U-Statistic is 7955556.0, which suggests the two groups overlap significantly meaning there is little difference between them. 

* The P-Value is 0.1729 which greater than the significance level. Indidcating the observed difference between Tenure and Attrition is not statistically significant. 

* Fail to reject the null hypothesis as no statistical significance. 