<a href="https://colab.research.google.com/github/thvarsha00/Bank-Customer-Churn-Prediction/blob/main/Bank_Customer_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROJECT TITLE:-**   Bank Customer Churn Prediction Analysis with machine learning

# This project explores a customer dataset to predict churn and uncover key factors driving customer attrition.
# Using a combination of Exploratory Data Analysis (EDA), feature engineering, clustering, and predictive modeling, we identify variables such as tenure, monthly charges, contract type, and payment method as strong predictors of churn.
# The analysis highlights actionable insights for customer retention and provides a foundation for building robust predictive models.

# **Problem Statement**

# The objective of this project is to analyze the key factors influencing customer churn and build predictive models to identify customers likely to leave.

# The workflow includes:

# **Exploratory Data Analysis (EDA**): Univariate, bivariate, and multivariate analysis to explore customer behavior patterns.

# **Feature Engineering**: Encoding categorical variables, scaling numerical features, and optional dimensionality reduction.

# **Clustering & Segmentation**: Using PCA and DBSCAN to identify customer segments.

# **Predictive Modeling**: Logistic Regression (Lasso & ElasticNet), Random Forest, and XGBoost with hyperparameter tuning.

#**Model Validation**: Evaluating models via confusion matrices, precision-recall curves, ROC AUC, and F1 score.

# **Interpretation & Explainability**: Feature importance, coefficients, and SHAP analysis to identify top churn drivers.

# **Model export**: Saving trained models (e.g., model.save('churn_model.h5') for neural networks) for future use and real-time predictions.


# **DATASET**
# **Name:** Customer Churn Dataset
# **Description:**
# The dataset contains customer demographic and transactional data, including variables such as tenure, contract type, monthly charges, payment method, gender, and service usage.

# **Target variable**: churn → 1 = churned, 0 = retained

#**1.import required libraries**

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Dimensionality reduction & clustering
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Model selection & evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
# Deep learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# **2.Load Data**

In [None]:
from google.colab import files
import pandas as pd
uploaded = files.upload()

# **3.data inspection**

In [None]:

# Load the uploaded CSV into a DataFrame
df = pd.read_csv(list(uploaded.keys())[0])


# Check column info

In [None]:

df.info()

# Summary statistics

In [None]:

df.describe()

# First 5 rows

In [None]:

df.head()


# columns Overview

In [None]:
print(df.columns)


#  **4.Feature Engineering**

# **Encoding categorical variables**

In [None]:
from sklearn.preprocessing import LabelEncoder
from IPython.display import display


# Encode gender

In [None]:

le_gender = LabelEncoder()
df['gender'] = le_gender.fit_transform(df['gender'])
print("Gender mapping:")
display(dict(zip(le_gender.classes_, le_gender.transform(le_gender.classes_))))

# Encode country

In [None]:

le_country = LabelEncoder()
df['country'] = le_country.fit_transform(df['country'])
print("Country mapping:")
display(dict(zip(le_country.classes_, le_country.transform(le_country.classes_))))

#  Separate features and target

In [None]:
# Features and target
X = df.drop('churn', axis=1)
y = df['churn']

# Show shapes
print("X shape:", X.shape)
print("y shape:", y.shape)

# scaling numerical features

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for easier viewing
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)


# Optional: show summary statistics to confirm scaling
print("\nScaled feature statistics:")
display(X_scaled_df.describe())


#  **5.Univariate Analysis**

# Numerical Features

In [None]:
import plotly.express as px


num_cols = ['age', 'tenure', 'balance', 'credit_score', 'estimated_salary']

for col in num_cols:
    fig = px.histogram(df, x=col, nbins=20, marginal="box",  # adds a mini box plot
                       title=f'Distribution of {col}',
                       template="plotly_white")
    fig.show()




# Categorical Features

In [None]:

cat_cols = ['gender', 'country', 'credit_card', 'active_member', 'products_number']

for col in cat_cols:
    fig = px.histogram(df, x=col, color='churn', barmode='group',
                       title=f'Counts of {col} by Churn',
                       template="plotly_white")
    fig.show()

# **6.Bivariate Analysis (Feature vs Target)**

# Numerical columns

In [None]:

num_cols = ['age', 'tenure', 'balance', 'credit_score', 'estimated_salary']

for col in num_cols:
    fig = px.box(df, x='churn', y=col, color='churn',
                 title=f'{col} vs Churn', template='plotly_white')
    fig.show()


# Categorical columns

In [None]:

cat_cols = ['gender', 'country', 'credit_card', 'active_member', 'products_number']

for col in cat_cols:
    fig = px.histogram(df, x=col, color='churn', barmode='group',
                       title=f'{col} vs Churn', template='plotly_white')
    fig.show()


# **7.Multivariate Analysis (Multiple Features Together)**

# Correlation Heatmap for Numerical Features

In [None]:
import plotly.figure_factory as ff
import numpy as np

num_cols = ['age', 'tenure', 'balance', 'credit_score', 'estimated_salary', 'churn']
corr_matrix = df[num_cols].corr().round(2)

fig = ff.create_annotated_heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns.tolist(),
    y=corr_matrix.index.tolist(),
    colorscale='Viridis',
    showscale=True,
    reversescale=True
)
fig.update_layout(title='Correlation Heatmap', template='plotly_white')
fig.show()


# Pairwise Scatter Plots (Numerical Features Colored by Churn)

In [None]:
num_cols = ['age', 'tenure', 'balance', 'credit_score', 'estimated_salary']

fig = px.scatter_matrix(df,
                        dimensions=num_cols,
                        color='churn',
                        title='Pairwise Scatter Plot of Numerical Features',
                        template='plotly_white',
                        symbol='churn')
fig.update_traces(diagonal_visible=False)  # optional, hides histograms on diagonal
fig.show()


# Categorical vs Categorical vs Target (Churn)

In [None]:
# Example: country vs gender vs churn
fig = px.histogram(df, x='country', color='churn', barmode='group',
                   facet_col='gender', title='Country vs Gender vs Churn',
                   template='plotly_white')
fig.show()


# Heatmap for Two Features vs Churn

In [None]:
# Pivot table: average churn by country and credit_card
heatmap_data = df.pivot_table(values='churn', index='country', columns='credit_card', aggfunc='mean')

fig = px.imshow(heatmap_data,
                text_auto=True,
                color_continuous_scale='Viridis',
                title='Average Churn by Country and Credit Card Ownership')
fig.show()


# 3D Scatter Plots (Three Numerical Features vs Churn)

In [None]:
fig = px.scatter_3d(df, x='age', y='balance', z='tenure', color='churn',
                    symbol='churn', title='3D Scatter: Age vs Balance vs Tenure',
                    template='plotly_white')
fig.show()


# Parallel Coordinates Plot

In [None]:
num_cols = ['age', 'tenure', 'balance', 'credit_score', 'estimated_salary']

fig = px.parallel_coordinates(df, dimensions=num_cols, color='churn',
                              color_continuous_scale=px.colors.diverging.Tealrose,
                              title='Parallel Coordinates: Multiple Features vs Churn')
fig.show()


# **8.Clustering with PCA + DBSCAN**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt


# Apply PCA for Dimensionality Reduction

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")


# Apply DBSCAN for Clustering

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_pca)

# Silhouette Score
mask = labels != -1
score = silhouette_score(X_pca[mask], labels[mask])
print(f"Silhouette Score (excluding noise): {score}")


#  Evaluate clustering

In [None]:

# Ignore noise points (-1) for Silhouette calculation
mask = labels != -1
score = silhouette_score(X_pca[mask], labels[mask])
print(f"Silhouette Score (excluding noise): {score}")

In [None]:
# Inspect cluster sizes
import numpy as np
unique, counts = np.unique(labels, return_counts=True)
print("Cluster distribution:", dict(zip(unique, counts)))

# Visualizing Clusters and Churn Patterns


In [None]:

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='viridis', s=50)
plt.title('DBSCAN Clusters (PCA Reduced)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap='coolwarm', s=50)
plt.title('PCA Components Colored by Churn')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.colorbar(label='Churn (0=No, 1=Yes)')
plt.show()


In [None]:
!pip install plotly -q
import plotly.express as px




# Create a DataFrame with PCA components, cluster labels, and churn

In [None]:

pca_df = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])
pca_df['Cluster'] = labels
pca_df['Churn'] = y.values

fig = px.scatter(
    pca_df, x='PCA1', y='PCA2',
    color='Churn',
    symbol='Cluster',
    hover_data=['PCA1','PCA2','Cluster','Churn'],
    title='Interactive PCA Scatter Plot by Churn and Cluster'
)
fig.show()

# Compute churn rate per cluster

In [None]:

churn_per_cluster = pca_df.groupby('Cluster')['Churn'].mean().reset_index()

fig = px.bar(
    churn_per_cluster,
    x='Cluster',
    y='Churn',
    color='Churn',
    title='Interactive Churn Rate per Cluster',
    labels={'Churn':'Churn Rate'}
)
fig.show()


# Correlation matrix

In [None]:

corr_matrix = df.corr().reset_index().melt(id_vars='index')
corr_matrix.columns = ['Feature1','Feature2','Correlation']

fig = px.imshow(df.corr(), text_auto=True, aspect='auto', color_continuous_scale='RdBu_r', title='Correlation Heatmap')
fig.show()


In [None]:
for col in ['credit_score','age','balance','products_number','estimated_salary']:
    fig = px.box(df, x='churn', y=col, color='churn', title=f'Interactive Boxplot: {col} vs Churn')
    fig.show()


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Convert X_pca to a DataFrame for easier handling
X_pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
X_pca_df['churn'] = y.values  # Add target for coloring

# Plot
plt.figure(figsize=(8,6))
colors = {0: 'blue', 1: 'red'}  # 0 = retained, 1 = churned
plt.scatter(X_pca_df['PC1'], X_pca_df['PC2'], c=X_pca_df['churn'].map(colors), alpha=0.6)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2D PCA of Features')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='Retained', markerfacecolor='blue', markersize=8),
                    plt.Line2D([0], [0], marker='o', color='w', label='Churned', markerfacecolor='red', markersize=8)])
plt.grid(True)
plt.show()


# **9.predictive modeling**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from xgboost import XGBClassifier
import joblib

# Train-test split


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)


# model1 (Logistic Regression) with Lasso (L1) & ElasticNet

# Lasso (L1)

In [None]:


lasso = LogisticRegression(penalty='l1', solver='saga', max_iter=5000, random_state=42)
param_grid_lasso = {'C': np.logspace(-3, 3, 7)}
lasso_cv = RandomizedSearchCV(lasso, param_grid_lasso, cv=5, scoring='roc_auc', n_iter=7, random_state=42)
lasso_cv.fit(X_train, y_train)
y_pred_lasso = lasso_cv.predict(X_test)
y_proba_lasso = lasso_cv.predict_proba(X_test)[:,1]

print("Best parameters:", lasso_cv.best_params_)
print(classification_report(y_test, y_pred_lasso))
print("ROC AUC:", roc_auc_score(y_test, y_proba_lasso))




#  ElasticNet

In [None]:

elastic = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=5000, random_state=42)
param_grid_elastic = {'C': np.logspace(-3,3,5), 'l1_ratio':[0.2,0.5,0.8]}
elastic_cv = RandomizedSearchCV(elastic, param_grid_elastic, cv=5, scoring='roc_auc', n_iter=5, random_state=42)
elastic_cv.fit(X_train, y_train)
y_pred_elastic = elastic_cv.predict(X_test)
y_proba_elastic = elastic_cv.predict_proba(X_test)[:,1]

print("\nElasticNet Best parameters:", elastic_cv.best_params_)
print(classification_report(y_test, y_pred_elastic))
print("ROC AUC:", roc_auc_score(y_test, y_proba_elastic))

# model 2 (Random Forest)

In [None]:

rf = RandomForestClassifier(random_state=42)
param_grid_rf = {
    'n_estimators':[100,200,300],
    'max_depth':[None,5,10,15],
    'min_samples_split':[2,5,10],
    'min_samples_leaf':[1,2,4]
}
rf_cv = RandomizedSearchCV(rf, param_grid_rf, cv=5, scoring='roc_auc', n_iter=10, random_state=42)
rf_cv.fit(X_train, y_train)
y_pred_rf = rf_cv.predict(X_test)
y_proba_rf = rf_cv.predict_proba(X_test)[:,1]

print("Best parameters:", rf_cv.best_params_)
print(classification_report(y_test, y_pred_rf))
print("ROC AUC:", roc_auc_score(y_test, y_proba_rf))


# Feature importance

In [None]:

feat_importances = pd.Series(rf_cv.best_estimator_.feature_importances_, index=X.columns)
feat_importances.sort_values(ascending=False).plot(kind='bar', figsize=(10,5))
plt.title("Random Forest Feature Importance")
plt.show()

# model 3 (XGBoost)

In [None]:

print("\n--- XGBoost ---")
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
param_grid_xgb = {
    'n_estimators':[100,200,300],
    'max_depth':[3,5,7],
    'learning_rate':[0.01,0.1,0.2],
    'subsample':[0.7,0.8,1],
    'colsample_bytree':[0.7,0.8,1]
}
xgb_cv = RandomizedSearchCV(xgb, param_grid_xgb, cv=5, scoring='roc_auc', n_iter=10, random_state=42)
xgb_cv.fit(X_train, y_train)
y_pred_xgb = xgb_cv.predict(X_test)
y_proba_xgb = xgb_cv.predict_proba(X_test)[:,1]

print("Best parameters:", xgb_cv.best_params_)
print(classification_report(y_test, y_pred_xgb))
print("ROC AUC:", roc_auc_score(y_test, y_proba_xgb))

# Feature importance

In [None]:

from xgboost import plot_importance
plot_importance(xgb_cv.best_estimator_, height=0.5, max_num_features=10)
plt.show()

# **10.Collect metrics for each model**

In [None]:
# Store results in a dictionary
results = {
    'Model': ['Logistic Lasso', 'Logistic ElasticNet', 'Random Forest', 'XGBoost'],
    'ROC AUC': [
        roc_auc_score(y_test, y_proba_lasso),
        roc_auc_score(y_test, y_proba_elastic),
        roc_auc_score(y_test, y_proba_rf),
        roc_auc_score(y_test, y_proba_xgb)
    ],
    'Accuracy': [
        (y_pred_lasso == y_test).mean(),
        (y_pred_elastic == y_test).mean(),
        (y_pred_rf == y_test).mean(),
        (y_pred_xgb == y_test).mean()
    ],
    'F1 Score': [
        classification_report(y_test, y_pred_lasso, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_elastic, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_xgb, output_dict=True)['1']['f1-score']
    ]
}

results_df = pd.DataFrame(results)
display(results_df)


# Visualize ROC AUC comparison

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.bar(results['Model'], results['ROC AUC'], color=['skyblue','lightgreen','salmon','orange'])
plt.ylabel('ROC AUC')
plt.title('Model Comparison - ROC AUC')
plt.ylim(0,1)
plt.show()


# Plot ROC curves together

In [None]:
from sklearn.metrics import roc_curve, auc

plt.figure(figsize=(8,6))

models = {
    'Logistic Lasso': y_proba_lasso,
    'Logistic ElasticNet': y_proba_elastic,
    'Random Forest': y_proba_rf,
    'XGBoost': y_proba_xgb
}

for name, y_proba in models.items():
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC={auc(fpr, tpr):.3f})')

plt.plot([0,1],[0,1],'k--')  # random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend()
plt.grid(True)
plt.show()


# SHAP Explanations (Global + Individual)

In [None]:
import shap

# ---- For XGBoost ----
explainer = shap.Explainer(xgb_cv.best_estimator_)
shap_values = explainer(X_test)

# Global feature importance
shap.summary_plot(shap_values, X_test, feature_names=X.columns, plot_type="bar")

# Detailed visualization for individual predictions
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0].values, X_test[0], feature_names=X.columns)


# **11.Model Validation**

# Confusion Matrix Analysis

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Predict on test set
y_pred_prob = xgb_cv.best_estimator_.predict_proba(X_test)[:,1]
y_pred_class = (y_pred_prob >= 0.5).astype(int)  # default threshold 0.5

cm = confusion_matrix(y_test, y_pred_class)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Retained','Churned'])
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()


# Precision-Recall Curve

In [None]:
from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
avg_precision = average_precision_score(y_test, y_pred_prob)

plt.figure(figsize=(8,5))
plt.plot(recall, precision, marker='.', label=f'XGBoost (AP={avg_precision:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()


# Save Your Trained Keras Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Example: a simple neural network model
model = Sequential([
    Dense(32, activation='relu', input_shape=(X_scaled.shape[1],)),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')  # output layer for binary classification
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_scaled, y, epochs=50, batch_size=32, validation_split=0.2)

# Save the trained model as .h5
model.save('churn_model.h5')
print(" Keras model saved as 'churn_model.h5'")


## **Conclusion**

# In this project, we built models to predict customer churn. Random Forest and XGBoost performed best with high ROC AUC (~0.86) and decent F1 scores, effectively identifying churned customers. Feature importance and SHAP analysis provided insights into key drivers of churn. A simple neural network was also trained, showing the ability to learn complex patterns. Overall, these models can help businesses proactively retain high-risk customers and reduce potential revenue loss.