# 🏦 Bank Customer Segmentation


## 📚 Table of Contents
### 1️⃣ Data Preparation & Exploration
- [1.1 📥 Import Libraries & Load Data](#import-libraries)

### 2️⃣ Data Cleaning & Preprocessing
- [2.1 🧹 Clean Missing Values](#clean-missing)
- [2.2 🔄 Create RFM Features](#rfm)
- [2.3 📏 Handle Outliers](#handle-outliers)

### 3️⃣ Exploratory Data Analysis
- [3.1 📊 Exploratory Data Analysis (EDA)](#exploratory-data-analysis)

### 4️⃣ Feature Engineering
- [4.1 🔍 Scaling & Feature Engineering](#scaling-encoding)

### 5️⃣ Model Development
- [5.1 🔄 Implement K-Mean](#kmean)
- [5.2 📊 Implement PCA](#pca)

### 6️⃣ Conclusions
- [6.1 💡 Key Insights](#key-insights)
- [6.2 📊 Model Explainability using SHAP](#SHAP)


---
## Goal of this Notebook🎯
- Perform Clustering / Segmentation on the dataset and identify popular customer groups along with their definitions/rules
- Perform Location-wise analysis to identify regional trends in India
- Perform transaction-related analysis to identify interesting trends that can be used by a bank to improve / optimi their user experiences
- Customer Recency, Frequency, Monetary analysis
- Network analysis or Graph analysis of customer data.

<a id="import-libraries"></a>
## 1.1 📥 Import Libraries & Load Data


In [None]:
import re
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import scipy.cluster.hierarchy as sch
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans 
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

from kneed import KneeLocator
from yellowbrick.cluster import KElbowVisualizer


In [None]:
df = pd.read_csv(r"data/bank_transactions.csv")
df.head()

In [None]:
df.info()

<a id="clean-missing"></a>
## **2.1 🧹 Clean Missing Values**

In [None]:
a = df['CustomerID'].nunique()
print(f"# unique customers:", a)
print(60*'-')
b = df['CustLocation'].nunique()
print(f"# unique locations:", b)
print(60*'-')
df['CustGender'].value_counts()

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace=True)
df.isna().sum()

In [None]:
# Convert datetime data
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

In [None]:
df['Age'] =df['TransactionDate'].dt.year - df['CustomerDOB'].dt.year

In [None]:
df.drop(df[df['CustGender']=='T'].index,inplace=True)
num_col = df.select_dtypes(include=np.number)
cat_col = df.select_dtypes(exclude=np.number)

In [None]:
plt.style.use("fivethirtyeight")
plt.figure(figsize=(30,40))
for index,column in enumerate(num_col):
    plt.subplot(12,6,index+1)
    sns.boxplot(data=num_col,x=column)
    
plt.tight_layout(pad = 1.0)

In [None]:
# The age is negative because anyone can open a Life Saver youth savings account on behalf of a child or grandchild so we will assign it to 0.
df.loc[df['Age'] < 0, 'Age'] = 0
# remove outliers
df = df[df['Age'] <= 100]

In [None]:
plt.figure(figsize=(6,3))
sns.boxplot(data=df, x='Age')
plt.show

<a id="rfm"></a>
## **2.2 🔄 Create RFM Features**

### Now we will create our RFM dataframe to make Customer RFM Analysis

In [None]:
df['TransactionDate1']=df['TransactionDate'] # for first transaction
df['TransactionDate2']=df['TransactionDate'] # for last transaction

# Creating MRF table
MRF_df = df.groupby("CustomerID").agg({
                                        "TransactionID" : "count",
                                        "CustGender" : "first",
                                        "CustLocation":"first",
                                        "CustAccountBalance"  : "median",
                                        "TransactionAmount (INR)" : "mean",
                                        "Age" : "median",
                                        "TransactionDate2":"max",
                                        "TransactionDate1":"min",
                                        "TransactionDate":"median"
                        })

MRF_df = MRF_df.reset_index()
MRF_df.info()

In [None]:
MRF_df.head()

In [None]:
MRF_df['CustomerID'].nunique()

In [None]:
MRF_df['TransactionID'].value_counts()

---
- ### Maximum amout of trx is 6 trx 
- ### our MRF df has no duplicated customer ID (which will avoid us from data leakage in the future)
---

### RFM is an appreviation for Recency, Frequency, and Monetary which is used a lot in the banking sector to know more about the customer.

- #### Recency: The Recent Customer Activity or Transaction.

- #### Frequency: The Frequency of Customer Transactions in a specific period of time.

- ####  Monetary: The Amount of spend in this period of time.

In [None]:
MRF_df.rename(columns={"TransactionID":"Frequency"},inplace=True)
MRF_df['Recency']=MRF_df['TransactionDate2']-MRF_df['TransactionDate1']
MRF_df['Recency']=MRF_df['Recency'].astype(str)
MRF_df['Recency'].value_counts()

In [None]:
# Extract the # of days only
MRF_df['Recency']=MRF_df['Recency'].apply(lambda x :re.search('\d+',x).group())
MRF_df['Recency']=MRF_df['Recency'].astype(int)
MRF_df['Recency'].value_counts()

In [None]:
MRF_df.rename(columns={"TransactionAmount (INR)":"Monetary"},inplace=True)

In [None]:
# 0 days mean that a customer has done transaction on the same day so we will replace it with 1
MRF_df.loc[MRF_df['Recency'] == 0, 'Recency'] = 1
MRF_df['Recency'].value_counts()

<a id="handle-outliers"></a>
## **2.3 📏 Handle Outliers**

In [None]:
df.drop(columns=["TransactionDate1","TransactionDate2"],inplace=True)
MRF_df.drop(columns=["TransactionDate1","TransactionDate2"],inplace=True)

In [None]:
num_col = MRF_df.select_dtypes(include=np.number)
cat_col = MRF_df.select_dtypes(exclude=np.number)

In [None]:
num_col.drop(columns={"Frequency"}, inplace = True)
num_col.head()

In [None]:
plt.style.use("fivethirtyeight")
plt.figure(figsize=(30,40))
for index,column in enumerate(num_col):
    plt.subplot(12,6,index+1)
    sns.boxplot(data=num_col,x=column)
    
plt.tight_layout(pad = 1.0)

In [None]:
# to claculate the otliers for each feature
lower_list=[]
upper_list=[]
num_list=[]
perc_list=[]
cols=['CustAccountBalance','Monetary', 'Age', 'Recency']
for i in cols:
    Q1 = MRF_df[i].quantile(0.25)
    Q3 = MRF_df[i].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    # calculate number of outliers
    num=MRF_df[(MRF_df[i] < lower) | (MRF_df[i] > upper)].shape[0]
    
    # calculate percentage of outliers
    perc = (num / MRF_df.shape[0]) * 100
    lower_list.append(lower)
    upper_list.append(upper)
    num_list.append(num)
    perc_list.append(round(perc,2))

dic={'lower': lower_list, 'upper': upper_list, 'outliers': num_list, 'Perc%':perc_list }

In [None]:
outliers_df=pd.DataFrame(dic,index=['CustAccountBalance','Monetary', 'Age', 'Recency'])
outliers_df

### they are seem like an outliers but actually they are not

---
<a id="exploratory-data-analysis"></a>
## **3.2 📊 Exploratory Data Analysis (EDA)**


In [None]:
MRF_df.head()

In [None]:
MRF_df.describe()

In [None]:
plt.figure(figsize=(7, 5))
correlation = MRF_df.select_dtypes(include=[float, int]).corr()

sns.heatmap(
    correlation,
    vmin=-1,          # allow full correlation range
    vmax=1,
    cmap='BuPu',
    annot=True,
    fmt='.1f',
    linecolor='white',
    cbar=True
)
plt.show()


In [None]:
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=3,nrows=1,figsize=(15,5))
palette_color = sns.color_palette('rocket')
ax[0].hist(x=MRF_df['Age'],color='purple')
ax[0].set_title("Distribution of Customer Age")
ax[1].pie(MRF_df['CustGender'].value_counts(),autopct='%1.f%%',colors=palette_color,labels=['Male','Female'])
ax[1].set_title("Customer Gender")
ax[2].hist(x=MRF_df['Frequency'],color='purple')
ax[2].set_title("Frequency Distribution")
plt.tight_layout()

In [None]:
diff = MRF_df['Recency'].max()-MRF_df['Recency'].min()
print("Difference between maximum and minimum recency:", diff)

In [None]:
MRF_df.info()

In [None]:
MRF_df=MRF_df.sort_values(by='TransactionDate')
MRF = MRF_df.copy() 
MRF = MRF.drop(columns=['CustGender','CustLocation', 'CustomerID'])
monthly=MRF.groupby([pd.Grouper(key='TransactionDate',freq='M')]).mean()

monthly

In [None]:
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(15,5))
ax[0].plot(monthly.index,monthly['CustAccountBalance'],color='purple',marker='o',label='Customer Account Balance')
ax[0].set_title('Cust Account Balance Over The Time')
ax[1].plot(monthly.index,monthly['Monetary'],color='purple',marker='o',label='Monetary')
ax[1].set_title("Monetary Over The Time")
plt.legend() 

---
<a id="scaling-encoding"></a>
## **4.1 🔍 Scaling & Feature Engineering**


In [None]:
MRF_df=MRF_df.reset_index(drop=True) # drop=True means the old index is not added as a column
MRF_df.head()


In [None]:
MRF_df.drop(columns=['CustLocation','TransactionDate'],inplace=True)

In [None]:
MRF_df['CustGender']=MRF_df['CustGender'].map({'M':1,'F':0})


In [None]:
MRF_df.drop(columns=['CustomerID'],inplace=True)

In [None]:
scaler = StandardScaler()
df_scaled=scaler.fit_transform(MRF_df)
df_scaled=pd.DataFrame(df_scaled,columns=MRF_df.columns) # convert the scaled NumPy array back to DataFrame
df_scaled.head()

In [None]:
df_scaled.info()

### Since the data is too large, we will sample it to make it easier to work with. We will use a random sample of 200,000 rows from the scaled dataframe.


In [None]:
df_scaled=df_scaled.sample(n=200000,random_state=42).reset_index(drop=True)
df_scaled.head()

---
<a id="kmean"></a>
## **5.1 🔄 Implement K-Mean**


In [None]:
# Kmeans algorithm settings
kmeans_set = {"init":"random", "max_iter":300, "random_state":42}

# Silhouette score and inertia
inertias = []
silhouette_coefficients = []

for k in range(3,7):
    kmeans = KMeans(n_clusters=k, **kmeans_set)    # open dictionary
    kmeans.fit(df_scaled)
    inertias.append(kmeans.inertia_)
    score=silhouette_score(df_scaled, kmeans.labels_)
    silhouette_coefficients.append(score)

In [None]:
fig,ax=plt.subplots(1,2,figsize=(20,5))
kl = KneeLocator(range(3,7), inertias, curve='convex', direction="decreasing") # KneeLocator is used to find the "elbow" point in the inertia plot
ax[0].plot(range(3,7), inertias ,marker='o',label='Elbow Method')
ax[0].set_xticks(range(3,7))
ax[0].set_xlabel("Number of Clusters", labelpad=20)
ax[0].set_ylabel("Inertia", labelpad=20)
ax[0].set_title("Elbow Method")
ax[0].axvline(x=kl.elbow, color='black', label='axvline-fullheight', ls='--', linewidth=3)

ax[1].plot(range(3,7), silhouette_coefficients,marker='o',label='silhouette coefficients')
ax[1].set_xticks(range(3,7))
ax[1].set_title("silhouette values")
ax[1].set_xlabel("Number of clusters")
ax[1].set_ylabel("silhouette_coefficients")
plt.tight_layout()

In [None]:
plt.figure(figsize=(30,10))
z = sch.linkage(df_scaled.iloc[:500,:], 'ward')
sch.dendrogram(z)
plt.show()

### from the elbow method, we can see that the optimal K range is between 4 and 5 clusters, so we will narrow down a range of possible K values to [4, 6].
### from the silhouette method, we can see that the optimal K value in the range we specified is **6 clusters**.

In [None]:
kmeans = KMeans(n_clusters=6 , **kmeans_set) # **kmeans_set is used to unpack the dictionary
kmeans.fit(df_scaled)
df_scaled['Label']=kmeans.labels_
df_scaled.head()

In [None]:
df_scaled['Label'].value_counts()

In [None]:
attributes = df_scaled.columns[:6]
colors=['purple','yellow','red','blue','magenta','orange']
cluster=['Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6']

fig = go.Figure()
for index in range(0,6):
    fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.Label == index, attributes]),
                                  theta=attributes,
                                  fill='toself',
                                  fillcolor=colors[index],
                                  opacity=0.4,
                                  name=cluster[index]))
    
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
                  title="Radar plot - Cluster attributes (Normalized Values)")
fig.show()

### **Visualize the Clusters using PCA**
#### Reduce the high-dimensional data to 2 principal components and then plot.

In [None]:
features = df_scaled.drop('Label', axis=1)

pca = PCA(n_components=2) 
principal_components = pca.fit_transform(features)

# Create a DataFrame for the principal components
pca_df = pd.DataFrame(data = principal_components,
                      columns = ['principal_component_1', 'principal_component_2'])

# Add the cluster labels to the PCA DataFrame
pca_df['Label'] = df_scaled['Label']

plt.figure(figsize=(10, 7))
sns.scatterplot(x='principal_component_1', y='principal_component_2', hue='Label',
                data=pca_df, palette='viridis', s=50, alpha=0.8)
plt.title('K-Means Clusters in PCA Reduced Space (2 Components)')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}% Variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}% Variance)')
plt.legend(title='Cluster')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# If you want to see explained variance ratio
print(f"Explained variance by PC1: {pca.explained_variance_ratio_[0]:.2f}")
print(f"Explained variance by PC2: {pca.explained_variance_ratio_[1]:.2f}")
print(f"Total explained variance by 2 PCs: {pca.explained_variance_ratio_.sum():.2f}")

---
<a id="pca"></a>
## **5.2 📊 Implement PCA**

In [None]:
# here we want to check how many component we need to explain 90%  of the variance
plt.style.use("fivethirtyeight")
pca = PCA().fit(df_scaled.iloc[:,:-1])
plt.plot(np.cumsum(pca.explained_variance_ratio_),color='purple',marker='o')  #EX: cusum([4,2,3,1,6]) ==>[4,6,9,10,16]
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
# x tickes
plt.xticks(np.arange(0, 10, 1))
plt.show()

In [None]:
pca = PCA(n_components=4)
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])

inertia = []
silhouette = []


for k in range(2,11):
    km = KMeans(init='k-means++', n_clusters=k, random_state=42)
    km.fit(pca_data)
    inertia.append(km.inertia_)
    P_score=silhouette_score(pca_data, km.labels_)
    silhouette.append(P_score)

In [None]:
pca = PCA(n_components=4)
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])

In [None]:
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(1,2,figsize=(20,5))
Pkl = KneeLocator(range(3,7), inertia, curve='convex', direction="decreasing")
# draw Elbow
ax[0].plot(range(3,7), inertia,color='purple',marker='o',label='Elbow Method')
ax[0].set_xticks(range(3,7))
ax[0].set_xlabel("Number of Clusters", labelpad=20)
ax[0].set_ylabel("Inertia", labelpad=20)
ax[0].set_title("Elbow Method")
ax[0].axvline(x=Pkl.elbow, color='black', label='axvline-fullheight', ls='--', linewidth=3)
# draw silhouette
ax[1].plot(range(3,7), silhouette,color='purple',marker='o',label='silhouette coefficients')
ax[1].set_xticks(range(3,7))
ax[1].set_title("silhouette values")
ax[1].set_xlabel("Number of clusters")
ax[1].set_ylabel("silhouette_coefficients")
plt.tight_layout()

In [None]:
PCA_kmeans = KMeans(n_clusters=7 , **kmeans_set)    
PCA_kmeans.fit(pca_data)
# create a dataframe for the pca data
pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2', 'PC3', 'PC4'])
pca_df['label'] = PCA_kmeans.labels_
pca_df.head()

In [None]:
attributes = pca_df.columns[:4]
colors=['purple','yellow','red','blue']
cluster=['Cluster 1','Cluster 2','Cluster 3','Cluster 4']

fig = go.Figure()
for index in range(0,4):
    fig.add_trace(go.Scatterpolar(r=np.mean(pca_df.loc[pca_df.label == index, attributes]),
                                  theta=attributes,
                                  fill='toself',
                                  fillcolor=colors[index],
                                  opacity=0.4,
                                  name=cluster[index]))
    
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
                  title="Radar plot - Cluster attributes (Normalized Values)")
fig.show()