
<b><h2><center>Unsupervised Learning Project - Bank Customer Segmentation</center></h2></b>
<center><img src="https://copyassignment.com/wp-content/uploads/2021/07/1_iejTpHhx-u_R73XQD0GFkg.jpeg" width="800" height="400"></center>

### **Description**
### **Context**

Most banks have a large customer base - with different characteristics in terms of age, income, values, lifestyle, and more. Customer segmentation is the process of dividing a customer dataset into specific groups based on shared traits.

According to a report from Ernst & Young, “A more granular understanding of consumers is no longer a nice-to-have item, but a strategic and competitive imperative for banking providers. Customer understanding should be a living, breathing part of everyday business, with insights underpinning the full range of banking operations.

### **Objective**
To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.

### **Data Description**
This dataset consists of 1 Million+ transaction by over 800K customers for a bank in India. The data contains information such as - customer age (DOB), location, gender, account balance at the time of the transaction, transaction details, transaction amount, etc.

### **Data Dictionary:**
     1. TransactionID: Transaction ID for every transaction
     2. CustomerID: Unique ID for each customer
     3. CustomerDOB: Customer Date of Birth
     4. CustGender: Gender of the customer
     5. CustLocation: Location of the customer
     6. CustAccountBalance: Account Balance at the time of the transaction
     7. TransactionDate: Date of the transaction
     8. TransactionTime: Time of the transaction
     9. TransactionAmount(INR): Transaction amount in Indian Rupees

## **Fetching Data from kaggle website**

In [9]:
from google.colab import files
files.upload()

KeyboardInterrupt: 

In [10]:
# install the kaggle API client
!pip install -q kaggle

In [11]:
# The Kaggle API client expects this file to be in ~/.kaggle, so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [12]:
! mkdir bank_customer_segmentation
%cd bank_customer_segmentation

/content/bank_customer_segmentation/bank_customer_segmentation


In [13]:
!kaggle datasets download -d shivamb/bank-customer-segmentation


Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 403, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [14]:
# # Unzipping downloaded file and removing unusable file
!unzip bank-customer-segmentation.zip -d bank_customer_segmentation
# !rm content/bank_customer_segmentation/bank-customer-segmentation.zip

# !unzip bank_customer_segmentation.zip
# !unzip /content/bank_customer_segmentation/bank-customer-segmentation.zip -d bank_customer_segmentation
!

unzip:  cannot find or open bank-customer-segmentation.zip, bank-customer-segmentation.zip.zip or bank-customer-segmentation.zip.ZIP.


## **Exploratory Data Analysis**

**Importing necessary libraries**

In [7]:
import datetime as dt
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import pdist
from scipy.spatial.distance import cdist

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to suppress warnings
import warnings

warnings.filterwarnings("ignore")

In [8]:
# loading the dataset
data = pd.read_csv('/content/bank_customer_segmentation/bank-customer-segmentation.zip')

FileNotFoundError: [Errno 2] No such file or directory: '/content/bank_customer_segmentation/bank-customer-segmentation.zip'

In [None]:
# Checking the shape of the dataset
print('The dataset has',data.shape[0], 'rows and', data.shape[1],'columns')

In [None]:
# check for first 5 rows of the data
data.head()

In [None]:
# copying the data to another variable to avoid any changes to original data
df = data.copy()

In [None]:
# let's look at the structure of the data
df.info()

In [None]:
## Show details data set
def check(df):
    l=[]
    columns=df.columns
    for col in columns:
        dtypes=df[col].dtypes
        nunique=df[col].nunique()
        sum_null=df[col].isnull().sum()
        l.append([col,dtypes,nunique,sum_null, num_duplicates])
    df_check=pd.DataFrame(l)
    df_check.columns=['column','dtypes','nunique','sum_null']
    return df_check
check(df)

**Observations:**
- This dataset has 3 numerical columns and 6 categorical columns.
- CustomerDOB and transaction date columns should be converted into date type.
- We can delete the missing values as we have very less missing data.
- We can calculate the age of the customer by subtraction customer DOB from Transaction date.
- We can delele the Transaction Time column as we don't need time of transaction.



In [None]:
# delete missing data
df.dropna(inplace=True)

In [None]:
#check for duplication
df.duplicated().sum()

In [None]:
# convert type of columns TransactionDate,CustomerDOB from string to datetime
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

In [None]:
df['TransactionMonth'] = df.TransactionDate.dt.month
df['TransactionMonthName'] = df.TransactionDate.dt.month_name()
df['TransactionDay'] = df.TransactionDate.dt.day
df['TransactionDayName'] = df.TransactionDate.dt.day_name()

In [None]:
# let's check the data types of TransactionDate and CustomerDOB
df.dtypes

#### Performing Recency, Frequency and Monetary analysis

In [None]:
# Calculate the recency, frequency and monetary value for each customer
snapshot_date = df['TransactionDate'].max() + dt.timedelta(days=1)
df_rfm = df.groupby('CustomerID').agg({
    'TransactionDate': lambda x: (snapshot_date - x.max()).days,
    'TransactionID': 'count',
    "TransactionAmount (INR)": 'sum'
})

In [None]:
df_rfm.rename(columns={
    'TransactionDate': 'recency',
    'TransactionID': 'frequency',
    "TransactionAmount (INR)": 'monetary_value',
}, inplace = True)

In [None]:
# Create bins and labels for each RFM score
r_bins = [0, 30, 60, 90, df_rfm['recency'].max()]
f_bins = [0, 1, 2, 3, 4, 5, df_rfm['frequency'].max()]
m_bins = [0, 100, 500, 1000, df_rfm['monetary_value'].max()]
r_labels = [4, 3, 2, 1]
f_labels = [1, 2, 3, 4, 5, 6]
m_labels = [1, 2, 3, 4]
df_rfm['r_score'] = pd.cut(df_rfm['recency'], bins=r_bins, labels=r_labels, include_lowest=True)
df_rfm['f_score'] = pd.cut(df_rfm['frequency'], bins=f_bins, labels=f_labels, include_lowest=True)
df_rfm['m_score'] = pd.cut(df_rfm['monetary_value'], bins=m_bins, labels=m_labels, include_lowest=True)

In [None]:
# Combine the RFM scores to create a single RFM score
df_rfm['RFM'] = df_rfm['r_score'].astype(str) + df_rfm['f_score'].astype(str) + df_rfm['m_score'].astype(str)

In [None]:
# Assign customer segments based on the RFM score
def get_segment(x):
    if x in ['111', '112', '113', '114', '115', '116', '117']:
        return 'High-value'
    elif x in ['121', '122', '123', '124', '125', '126', '127']:
        return 'Mid-value'
    else:
        return 'Low-value'
df_rfm['segment'] = df_rfm['RFM'].apply(get_segment)

In [None]:
#Combine the segment column with the main dataframe
segment = df_rfm['segment']

# Merge the dataframes based on CustomerID
df = pd.merge(df, segment, on='CustomerID')

**Calculating the age of the customer**

In [None]:
# let's calculate the customer age using Transaction date and Customer Date of Birth
df['CustomerAge'] =df['TransactionDate'].dt.year - df['CustomerDOB'].dt.year

In [None]:
df.head()

- There are negative age seen in the customer age column

**Analysing the CustomerAge column**

In [None]:
df[df['CustomerAge'] < 0]

- We can drop the negative age data

In [None]:
df['TransactionDate'].max()

In [None]:
df['TransactionDate'].min()

- We have transactions of one year in our dataset which starts from January 2016 and ends in December 2016.

In [None]:
df['CustomerAge'].describe()

In [None]:
df['CustomerAge'].max()

- We can drop the data for which the customer has age greater than 100 and age less than 0

In [None]:
# Filter rows where CustomerAge is greater than 100 and CustomerAge less than 0
filtered_df_1 = df[df['CustomerAge'] > 100]
filtered_df_2 = df[df['CustomerAge'] < 0]

# Combine the indices of filtered DataFrames
indices_to_drop = filtered_df_1.index.union(filtered_df_2.index)

# Drop the resulting rows from the original DataFrame
df.drop(indices_to_drop, inplace=True)


**Droping the data for age less than zero and age greater than 100 in CustomerAge column**

In [None]:
df['CustomerAge'].describe()

**Droping the TransactionTime column**

In [None]:
# deleting the TransactionTime column
# df.drop(columns=['TransactionTime'], inplace=True)

In [None]:
df.columns

In [None]:
# let's check the count of gender
df.CustGender.value_counts(normalize=True)

In [None]:
df.CustGender.value_counts()

- Majority of Customer are Male customer in this data set.
- The data set has 72% of Male customer and 27% of Female customer.

In [None]:
df['CustAccountBalance'].max()

In [None]:
df.groupby(by='CustLocation')['CustAccountBalance'].sum().sort_values(ascending=False)

In [None]:
df[df['CustAccountBalance'] == 0]

- There are 1614 customer with zero balance in their account

In [None]:
df[df['TransactionAmount (INR)'] > 1000000]

- There are only 2 transaction above 1 million.

In [None]:
df['CustLocation'].value_counts()

In [None]:
cities = df['CustLocation'].unique().tolist()
print(cities)
print(len(cities))

- Metro cities like Mumbai, Bangalore, New Delhi has more transactions.

In [None]:
df.groupby(by='CustLocation')['CustAccountBalance'].max()

### Location Data Analysis


In [None]:
indian_cities = pd.read_csv('/content/indian_cities.csv')

In [None]:
indian_cities.head()

In [None]:
# def clean_text(text):
#     stripped_text = ''.join(c for c in text if c.isalpha())
#     lowercase_text = stripped_text.lower()
#     return lowercase_text
import unicodedata

def clean_text(text):
    # normalize the text using NFKD method
    normalized_text = unicodedata.normalize('NFKD', text)

    # remove any non-alphabetic characters
    stripped_text = ''.join(c for c in normalized_text if c.isalpha())

    # convert the text to lowercase
    lowercase_text = stripped_text.lower()

    return lowercase_text
indian_cities['city'] = indian_cities['city'].apply(clean_text)
indian_cities['admin_name'] =  indian_cities['admin_name'].apply(clean_text)

In [None]:
states_map = dict(zip(indian_cities['city'], indian_cities['admin_name']))

In [None]:
!pip install transformers

In [None]:
from transformers import BertTokenizer

import re
# Instantiate a BertTokenizer object
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define a function to remove numeric tokens from a string
def remove_numeric_tokens(text):
    # Tokenize the text using the BertTokenizer
    tokens = tokenizer.tokenize(text)
    # Remove numeric tokens
    tokens = [token for token in tokens if not token.isnumeric()]
    # Convert the list of tokens back to a string
    text = tokenizer.convert_tokens_to_string(tokens)
    return text

# Apply the remove_numeric_tokens function to the CustLocation column
df['CustLocation'] = df['CustLocation'].apply(remove_numeric_tokens)

# Print the updated DataFrame
print(df.head())


In [None]:
# define a function to classify addresses based on city name
# def classify_address(address):
#     for city in states_map.keys():
#         if city in address:
#             return states_map[city]
#     return 'Unknown'

def classify_address(address):
    for city, state in states_map.items():
        if city in address:
            return state
        # check if any words in address match a city with a slightly different name
        for word in address.split():
            if word in city.split():
                return state
    return 'Unknown'

# classify the given addresses
df['State/Union'] = df['CustLocation'].apply(classify_address)


In [None]:
df['State/Union'].unique()

In [None]:
unknown_cities_mask = df['State/Union'] == 'Unknown'
unknown_cities_df = df.loc[unknown_cities_mask]

In [None]:
unknown_cities_index = df.loc[unknown_cities_mask].index
df.drop(index=unknown_cities_index, inplace=True)

In [None]:
df

## **Univariate Analysis**

In [None]:
# assigning the numerical columns and categorical columns
num_col = df.select_dtypes(include=np.number)
cat_col = df.select_dtypes(exclude=np.number)

In [None]:
# plotting boxplot for numerical columns
plt.style.use("fivethirtyeight")
plt.figure(figsize=(30,30))
for index,column in enumerate(num_col):
    plt.subplot(7,4,index+1)
    sns.boxplot(data=num_col,x=column)

plt.tight_layout(pad = 1.0)

**Observations**:
- CusAccountBalance and TransactionAmount(INR) has outliers, but they are not ouliers in our case as there is no limit in having balance and transaction amount.
- CutomerAge has outliers in left side indicates that they are younger children age.

In [None]:
# ploting countplot for custGender
sns.countplot(data = df, x ='CustGender')

- We already know that we have majority of male customers when compared to female customers.

##**Multivariate Analysis**

In [None]:
# ploting boxplot for CustGender and CustomerAge
sns.boxplot(data=df, x='CustGender', y='CustomerAge') ;

- Most of the customers in male and female are in age range of 22 years to 32 years.
- Both female and male customer age has children accounts which are outlier at the bottom.

In [None]:
# ploting boxplot for CustGender and CustAccountBalance
sns.boxplot(data=df, x='CustGender', y='CustAccountBalance') ;

- We have outliers in both female and male customer account balance which denotes that lot of customers has high account balance in their account.

In [None]:
df['TransactionDate'].dt.month.value_counts()

**Creating new column as TransactionMonth as we have data of one year transaction**

In [None]:
# extracting the month from Transaction date column
df['TransactionMonth'] = df['TransactionDate'].dt.month

In [None]:
df.head()

In [None]:
df.columns

In [None]:
# ploting line plot for TransactionMonth and TransactionAmount
sns.lineplot(data = df , x = 'TransactionMonth' , y = 'TransactionAmount (INR)');

- This lineplot shows that customers had higher transaction amount in the month of April followed by June.

In [None]:
#Ploting lineplot for TransactionMonth and TransactionAmount
sns.lineplot(data = df, x='TransactionMonth', y='TransactionAmount (INR)', ci=False, hue='CustGender');

- The transaction amount of female customers is higher than male customers.

In [None]:
# ploting lineplot for TransactionMonth and CustAccountBalance
sns.lineplot(data = df , x = 'TransactionMonth' , y = 'CustAccountBalance');

- Customers account balance is higher in August month followed my November.

In [None]:
#ploting lineplot for TransactionMonth and CustAcoountBalance
sns.lineplot(data = df, x='TransactionMonth', y='CustAccountBalance', ci=False, hue='CustGender');

- Most of the female customers has high balance in their account when compared with male customers.

In [None]:
# Ploting scatterplot for CustomerAge and TransactionAmount
sns.scatterplot(data=df, x='CustomerAge', y='TransactionAmount (INR)');

- Positive correlation is seen as the age increases transaction amount increases.

In [None]:
# ploting scatter between CustomerAge and CustAccountBalance
sns.scatterplot(data=df, x='CustomerAge', y='CustAccountBalance');

- same as previous plot, we can see positive correlation as the age of the customer increases account balance of the customer also increases.

In [None]:
# plt.figure(figsize=(40,7))
# sns.pairplot(data = df ,hue='CustGender');

In [None]:
df.info()

In [None]:
sns.heatmap(data=df[['CustomerAge','TransactionMonth','TransactionAmount (INR)','CustAccountBalance']].corr(), annot=True);

- There are no strong correlation seen between the variables
- Customer age has some correlation with  customer account balance and transaction amount.

## **Inights form EDA**
- There are 1048567 observations and 9 columns
- We have missing values in 4 columns which are CustomerDOB, CustGender, CustLocation, CustAccountBalance.
- Missing values are dropped as the missing values are less compared to our dataset.
- Customer Age is calculated using Transaction Date and Date of the birth of the customer
- Majority of the customers are Male. The data set has 72% of Male customer and 27% of Female customer.
- Most of the customers in male and female are in age range of 22 years to 32 years.
- customers had higher transaction amount in the month of April followed by June.
- The transaction amount of female customers is higher than male customers.
- Customers account balance is higher in August month followed my November.
- Most of the female customers has high balance in their account when compared with male customers.
- There are no strong correlation seen between the variables

## **Data Preparation**


- We can delete unnecessary columns:
  - Transaction ID is unique for every transaction, so we can delete the column
  - We have transaction data for the year 2016, already we extracted the month of the transaction. So we can delete the transaction date column in our data.
  - We calculated age of our customer using customer date of birth and transaction date. So we can delete the customer date of birth column.
  - We already deleted transaction time column.

- Missing values are less as we have 1 Million observations, so we dropped our missing values.
- Outliers found in age, account balance and transaction account. We are not going to treat the outliers for this data.

In [None]:
df.columns

In [None]:
del_col = ['CustomerID', 'TransactionID', 'CustomerDOB', 'TransactionDate', 'TransactionTime', 'CustLocation']
df.drop(columns=del_col, inplace=True)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.reset_index(drop=True)

In [None]:
# df.drop(columns='CustomerID', inplace=True)

In [None]:
df.reset_index(drop=True)

In [None]:
df.to_excel('Clean_bank_transaction.xlsx', index=False)

# Data Modelling


In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('/content/bank_customer_segmentation/Clean_bank_transaction.xlsx')

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
# df.drop('CustLocation', inplace=True, axis=1)

In [None]:
check(df)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create an instance of the LabelEncoder
label_encoder = LabelEncoder()

# Fit the encoder to the CustGender column in the DataFrame
label_encoder.fit(df['CustGender'])

# Transform the CustGender column using the fitted encoder
df['CustGender'] = label_encoder.transform(df['CustGender'])

# Fit the encoder to the segment column in the DataFrame
label_encoder.fit(df['segment'])

# Transform the segment column using the fitted encoder
df['segment'] = label_encoder.transform(df['segment'])

In [None]:
df.head()

In [None]:
# Perform one-hot encoding for the 'State' column
ohe_df = pd.get_dummies(df['State/Union'], prefix='State')

# Add the one-hot encoded columns to the original DataFrame
df = pd.concat([df, ohe_df], axis=1)

In [None]:
df.drop('State/Union', axis=1, inplace=True)

In [None]:
df

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

scaler = StandardScaler()
df_std = scaler.fit_transform(df)

In [None]:
type(df_std)

In [None]:
import tensorflow as tf
# Setting up TPU strategy
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.TPUStrategy(tpu)

In [None]:
import time
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score

# Define the range of cluster numbers to test
n_clusters = range(2, 8)

# Initialize a list to store the silhouette scores for each cluster number
scores = []

# Loop through each cluster number and compute the corresponding silhouette score
for n in n_clusters:
    with strategy.scope():
        print(f'Running k-means with {n} clusters...')
        start_time = time.time()
        kmeans = KMeans(n_clusters=n, random_state=42)
        kmeans.fit(df)
        score = calinski_harabasz_score(df, kmeans.labels_)
        scores.append(score)
        end_time = time.time()
        print(f'Time spent: {end_time - start_time} seconds')
        print()

# Choose the optimal number of clusters based on the Calinski-Harabasz score
optimal_n_clusters = n_clusters[np.argmax(scores)]


In [None]:
print(scores)

In [None]:
# Plot the Calinski-Harabasz score as a function of the number of clusters
plt.plot(n_clusters, scores, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Calinski-Harabasz Score')
plt.title('Elbow Plot')
plt.show()

In [None]:
print(f'Optimal number of clusters: {optimal_n_clusters}')


In [None]:
# Fit the data to the optimal number of clusters
with strategy.scope():
    kmeans = KMeans(n_clusters=optimal_n_clusters, random_state=42)
    kmeans.fit(df)

# Assign each data point to a cluster
labels = kmeans.predict(df)

In [None]:
print(labels)

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(df.values[:, 0], df.values[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'Clusters with k={optimal_n_clusters}')
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="upper right", title="Clusters")
ax.add_artist(legend1)
plt.show()

In [None]:
# Show the data in the different clusters
# for i in range(optimal_n_clusters):
#     print(f'Cluster {i+1} has {len(df[labels == i])} data points:')
#     print(df[labels == i])
#     print()

In [None]:
def plot_col_cluster(column, df=df):
  for col in column:
    plt.figure(figsize=(20,10))
    plt.plot(df[col][labels==0], label='Cluster 1')
    plt.plot(df[col][labels==1], label='Cluster 2')
    plt.plot(df[col][labels==2], label='Cluster 3')
    plt.plot(df[col][labels==3], label='Cluster 4')
    plt.plot(df[col][labels==4], label='Cluster 5')
    plt.plot(df[col][labels==5], label='Cluster 6')
    plt.plot(df[col][labels==6], label='Cluster 7')
    plt.show()

In [None]:
columns = df.columns.tolist()

In [None]:
# plot_col_cluster(columns)

Hierarchical Clustering

In [None]:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# from scipy.cluster.hierarchy import dendrogram, linkage

# # Create a DataFrame with some data
# # df = pd.DataFrame({
# #     'A': [1, 2, 3, 5, 7, 8, 9, 10],
# #     'B': [10, 20, 30, 40, 50, 60, 70, 80]
# # })

# # Compute the linkage matrix
# Z = linkage(df, 'ward')

# # Plot the dendrogram
# plt.figure(figsize=(10, 5))
# dendrogram(Z, leaf_rotation=90., leaf_font_size=8.)
# plt.xlabel('Samples')
# plt.ylabel('Distance')
# plt.show()


In [None]:
with strategy.scope():
  # Normalize the data
  df_norm = (df - df.mean()) / df.std()

  # Compute the linkage matrix
  Z = linkage(df_norm[:5000], 'ward')

  # Plot the dendrogram
  plt.figure(figsize=(10, 5))
  dendrogram(Z, leaf_rotation=90., leaf_font_size=8.)
  plt.xlabel('Samples')
  plt.ylabel('Distance')
  plt.show()

In [None]:
# Extract cluster labels
from scipy.cluster.hierarchy import fcluster
max_dist = 10
cluster_labels = fcluster(Z, max_dist, criterion='distance')

# Convert cluster labels to DataFrame
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])

# Concatenate cluster labels with original data
df_clustered = pd.concat([df_norm, cluster_labels], axis=1)
with strategy.scope():
  # Visualize the distribution of data points within each cluster
  sns.set_style('whitegrid')
  g = sns.PairGrid(df_clustered[:5000], hue='ClusterLabel', palette='Set2')
  g.map_diag(plt.hist)
  g.map_offdiag(plt.scatter, alpha=0.7)
  g.add_legend()
  plt.show()