<a href="https://colab.research.google.com/github/shanojpillai/GenerativeAI_100Days/blob/main/BankCustomerChurnPrediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bank Customer Churn Prediction Project

# Project Overview
* Project Title: **bold text** Bank Customer Churn Prediction for ABC Multistate Bank
* Objective: **bold text** Predict customer churn, identifying customers likely to leave the bank based on historical data.
* Target Variable: **bold text** churn (1 if the customer left the bank, 0 otherwise)
* Dataset Summary: **bold text** Contains features such as customer demographics, account balance, and banking behavior for thousands of customers.




## Document Structure
* **Data Exploration**: Analyze the dataset to understand its structure and check for missing values.
* **Data Preprocessing**: Prepare the dataset for modeling, including encoding categorical variables and scaling.
* **Model Building**:Build predictive models, starting with logistic regression as a baseline.
* **Model Evaluation**:Assess model performance with accuracy, precision, recall, and F1-score.
* **Interpretability and Insights**:Use techniques like SHAP for feature importance and interpretation.
* **Future Improvements**: Suggestions for model enhancements and further research.

# **Step 1: Data Exploration**
The goal of data exploration is to understand the dataset’s structure, identify any missing values, and uncover patterns that could inform model building.

In [2]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/Bank Customer Churn Prediction.csv')

# Display the first few rows of the dataset
data.head()

# Check for missing values
print("Missing Values in Each Column:")
print(data.isnull().sum())

# Summary statistics
data.describe()


Missing Values in Each Column:
customer_id         0
credit_score        0
country             0
gender              0
age                 0
tenure              0
balance             0
products_number     0
credit_card         0
active_member       0
estimated_salary    0
churn               0
dtype: int64


Unnamed: 0,customer_id,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


### **Step 2: Data Preprocessing**
2.1 Drop Unnecessary Columns
Since customer_id is an identifier and does not contribute to predicting churn, we can drop it.

In [3]:
# Drop the 'customer_id' column
data = data.drop(['customer_id'], axis=1)


## 2.2 Encoding Categorical Variables
The dataset contains categorical columns (country and gender) that need to be converted into numeric format for machine learning algorithms. We use one-hot encoding for these columns.

In [4]:
# One-hot encode 'country' and 'gender' columns
data = pd.get_dummies(data, columns=['country', 'gender'], drop_first=True)


# 2.3 Splitting Features and Target
Separate the features (X) from the target (y).

In [5]:
# Define X (input features) and y (target variable)
X = data.drop('churn', axis=1)
y = data['churn']


# 2.4 Train-Test Split
We split the data into training and testing sets, using 80% for training and 20% for testing.

In [6]:
from sklearn.model_selection import train_test_split

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 2.5 Feature Scaling
Scaling features ensures consistency in feature magnitudes.

In [7]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training set; transform the testing set
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# Step 3: Model Building
We start with Logistic Regression, a commonly used model for binary classification, then explore more complex models like Random Forest if needed.

# Model 1: Logistic Regression
Description: Logistic Regression is a linear model used for binary classification problems. It models the probability of a binary outcome by fitting a logistic curve. Logistic Regression is often chosen as a baseline model due to its simplicity and interpretability.

Why it’s Used Here: It’s effective for cases where there is a linear relationship between features and the target variable. Logistic Regression also provides probabilities, which can be used for threshold adjustments to balance precision and recall.

In [8]:
from sklearn.linear_model import LogisticRegression

# Initialize and fit the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)


# Model 2: Random Forest (Optional for Improvement)
* **Description**: Random Forest is an ensemble learning method that combines multiple decision trees to improve performance. It uses bootstrapping (sampling with replacement) and feature randomness to create a set of diverse trees, which are averaged to make predictions. This model is known for handling non-linear relationships and high-dimensional data effectively.

* **Why it’s Used Here**: Random Forest is resilient to overfitting and can model complex patterns in the data, making it effective for capturing non-linear relationships. It is especially useful if the logistic regression model does not perform well, as Random Forest can adapt to more complex data structures.

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and fit Random Forest with class weights
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)


# Step 4: Model Evaluation
Evaluate the model using metrics like accuracy, precision, recall, and F1 score.

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict on test data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Logistic Regression Model Performance:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Logistic Regression Model Performance:
Accuracy: 0.811
Precision: 0.5524475524475524
Recall: 0.2010178117048346
F1 Score: 0.2947761194029851


# Step 5: Interpretability and Insights
Understanding which features most influence churn is crucial. Here, we use SHAP (SHapley Additive exPlanations) to interpret feature importance, providing insights into how each feature contributes to churn predictions.

* SHAP for Interpretability
Description: SHAP values provide a consistent approach to measure each feature’s contribution to a prediction. They assign each feature an importance score for individual predictions, allowing for both global and local interpretability.

Why it’s Used Here: SHAP helps understand which customer attributes are most associated with churn, aiding in targeted retention strategies.

In [5]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/Bank Customer Churn Prediction.csv')


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Drop 'customer_id' as it is not useful for prediction
data = data.drop(['customer_id'], axis=1)

# One-hot encode categorical columns
data = pd.get_dummies(data, columns=['country', 'gender'], drop_first=True)

# Separate features (X) and target variable (y)
X = data.drop('churn', axis=1)
y = data['churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features for consistency
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert scaled arrays back to DataFrame for SHAP
X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)


In [7]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest model
rf_model = RandomForestClassifier(random_state=42, n_estimators=50, class_weight='balanced')
rf_model.fit(X_train, y_train)


In [16]:
# Check the shape of shap_values
print(f"shap_values shape (for class 1): {shap_values[1].shape}")
# Check the shape of X_sample
print(f"X_sample shape: {X_sample.shape}")


shap_values shape (for class 1): (11, 2)
X_sample shape: (100, 11)


In [17]:
# Adjust SHAP values to match X_sample columns
if shap_values[1].shape[1] > X_sample.shape[1]:
    shap_values_adjusted = shap_values[1][:, :X_sample.shape[1]]
else:
    shap_values_adjusted = shap_values[1]


In [18]:
# Compare feature names
print("Model Features:", rf_model.feature_names_in_)
print("X_sample Features:", X_sample.columns.tolist())


Model Features: ['credit_score' 'age' 'tenure' 'balance' 'products_number' 'credit_card'
 'active_member' 'estimated_salary' 'country_Germany' 'country_Spain'
 'gender_Male']
X_sample Features: ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'credit_card', 'active_member', 'estimated_salary', 'country_Germany', 'country_Spain', 'gender_Male']


In [19]:
# Recreate X_sample with correct feature names
X_sample = pd.DataFrame(X_sample, columns=rf_model.feature_names_in_)
