In [36]:
## Null Hypothesis

The null hypothesis (H0) states that there is no effect of any of the features on the likelihood of customers exiting the bank. In mathematical terms, this can be represented as follows:

- **H0:** The variations in the features (such as age, balance, geography, gender, etc.) do not lead to significant changes in the customer exit rates.

This hypothesis suggests that any observed differences in exit rates across different customer segments are due to randomness or other factors not captured by the features in our dataset.


SyntaxError: invalid syntax (2383782876.py, line 3)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
%matplotlib inline

file = 'cleaned-A.csv'
bank_data = pd.read_csv(file)

In [None]:
bank_data.head()

In [None]:
bank_data.drop(columns='Unnamed: 0', inplace=True)

In [None]:
bank_data.head()

In [None]:
bank_data['Gender'] = bank_data['Gender'].map({'Male': 1, 'Female': 0})
if 'Geography' in bank_data.columns:
    bank_data = pd.get_dummies(bank_data, columns=['Geography'], drop_first=True)

# You may need to handle other non-numeric columns in a similar way.

# Now, compute the correlation matrix
corr_matrix_encoded = bank_data.corr()

# Set up the matplotlib figure
plt.figure(figsize=(12, 8))

# Generate a heatmap
sns.heatmap(corr_matrix_encoded, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)

# Set titles and labels for better readability
plt.title('Heatmap of Bank Data Correlation Matrix')
plt.xlabel('Features')
plt.ylabel('Features')

# Display the heatmap
plt.show()

In [None]:
# Analysis of Customer Churn by Age

## Key Findings

Our statistical analysis has revealed a **moderate positive correlation** between the age of customers and the likelihood of churn, with a **correlation coefficient** of **0.34076816340757715**. This indicates that older customers tend to exit more than younger ones. This pattern suggests a segment of the customer base that is more likely to churn, warranting further investigation into the reasons behind this trend.
This can be seen through the following plots.
### For Older Customers

- **Understanding Needs**: It’s crucial to delve deeper into understanding the specific needs and preferences of older customers. Are there services or aspects of the customer experience that are not meeting their expectations?

- **Customization**: Tailor services and communication to address the unique needs of older customers. This could involve offering products that appeal more to their lifestyle or providing more accessible customer support.

- **Engagement**: Increase engagement with older customers through targeted communication strategies that resonate with their preferences and habits.

### For Younger Customers

- **Retention Strategies**: While younger customers are less likely to churn according to the correlation, Implementing retention strategies that keep younger patrons happy and satisfied with the services.

- **Innovation and Technology**: Leverage technology and innovative services that appeal to younger customers. This demographic is often more tech-savvy and might appreciate digital-first approaches, mobile services, and online communities..



In [37]:
# Select only the numeric columns for correlation computation
numeric_columns = bank_data.select_dtypes(include=[np.number])

# Calculate the correlation matrix for numeric columns only
corr_matrix = numeric_columns.corr()

# Look specifically at the correlation between 'Age_Standardized' and 'Exited'
# Assuming 'Exited' is a numeric column (0 or 1 for example in case of classification)
age_exited_correlation = corr_matrix.loc['Age_Standardized', 'Exited']

# Output the correlation value
print("Correlation between standardized age and churn:", age_exited_correlation)

KeyError: 'Age_Standardized'

In [None]:
bins = [18, 30, 40, 50, 60, 70, 80]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
bank_data['AgeGroup'] = pd.cut(bank_data['Age'], bins=bins, labels=labels, right=False)
import matplotlib.pyplot as plt

# Assuming bank_data is your DataFrame and it now includes the 'AgeGroup' column

# Calculate counts of entries in each AgeGroup
age_group_counts = bank_data['AgeGroup'].value_counts().sort_index()

# Plotting
plt.figure(figsize=(10, 6))  # Set the figure size (optional)
age_group_counts.plot(kind='bar')
plt.title('Distribution of Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

In [None]:
# Create the box plot for Age distribution by Exited status
plt.figure(figsize=(10, 6))
sns.boxplot(x='Exited', y='Age', data=bank_data)

# Set titles and labels
plt.title('Age Distribution by Exited Status')
plt.xlabel('Exited Status (0 = No, 1 = Yes)')
plt.ylabel('Age')

# Display the plot
plt.show()

In [None]:
This shows that more older customers are exiting then youn and middle aged patrons.

In [None]:
## Correlation Analysis: NumOfProducts and Exited

Our data analysis revealed a moderately negative correlation between `NumOfProducts` and `Exited`, with a correlation coefficient of -0.21. This suggests that customers who engage with more products offered by the bank are less likely to exit. Specifically, the negative value indicates an inverse relationship; as the number of products increases, the likelihood of customer exit decreases.

### Implications

This finding may indicate that having multiple products could be a significant factor in customer retention strategies. It implies that efforts to encourage customers to use a wider range of products may  reduce the churn rate.

**Key Insight:**
- Engaging customers with more products appears to be beneficial in terms of retention. Banks should consider strategies that promote the use of multiple products to their customers.

In [None]:
bank_data.replace([np.inf, -np.inf], np.nan, inplace=True)

# Calculate churn rate by number of products
churn_rate_by_product = bank_data.groupby('NumOfProducts')['Exited'].mean().reset_index()

# Calculate the number of customers for each number of products
customer_count_by_product = bank_data.groupby('NumOfProducts')['Exited'].size().reset_index(name='CustomerCount')

# Merge the two DataFrames
merged_data = pd.merge(churn_rate_by_product, customer_count_by_product, on='NumOfProducts')

# Create a figure with a secondary y-axis
fig, ax1 = plt.subplots(figsize=(10, 6))

# Bar plot for customer counts
sns.barplot(x='NumOfProducts', y='CustomerCount', data=merged_data, ax=ax1, alpha=0.5, color='gray')

# Line plot for churn rate
ax2 = ax1.twinx()
sns.lineplot(x='NumOfProducts', y='Exited', data=merged_data, ax=ax2, marker='o', color='blue', linewidth=2)

# Labels and title
ax1.set_xlabel('Number of Products')
ax1.set_ylabel('Customer Count', color='gray')
ax2.set_ylabel('Churn Rate', color='blue')
plt.title('Customer Churn Rate and Count by Number of Products')

# Show the plot
plt.show()


In [None]:
This shows alot more churn among people with less products on their account

In [None]:
bank_data.columns

In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Preprocess the data
# Select numeric features, excluding 'Exited' for PCA
numeric_features = bank_data.select_dtypes(include=[np.number]).drop(columns=['Exited'], errors='ignore')

# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numeric_features)

# Step 3: Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 components for visualization
pca_result = pca.fit_transform(scaled_features)

# Step 4: Create the pca_df DataFrame
pca_df = pd.DataFrame(data=pca_result, columns=['Principal Component 1', 'Principal Component 2'])

# Adding the 'Exited' column for color-coding in the plot
pca_df['Exited'] = bank_data['Exited']

# Now pca_df is ready and contains the PCA results along with the 'Exited' status

In [38]:
# Plotting the PCA results, colored by 'Exited' status
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Principal Component 1', y='Principal Component 2', hue='Exited', data=pca_df, palette='coolwarm', alpha=0.6)
plt.title('PCA of Bank Dataset by Churn Status')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Churn Status', loc='best')
plt.show()

NameError: name 'pca_df' is not defined

<Figure size 1000x800 with 0 Axes>

In [None]:
```
These observations can lead to hypothesis formation. 
For example, if the second principal component represents higher balances 
and older age, a hypothesis might be that older customers with higher 
balances are churning because they don't like the savings rates or are moving to competitors offering better terms.
```


In [None]:
profile = ProfileReport(bank_data,title="Bank Churn Profile")

# Save the report to .html
profile.to_file("churn_report.html")