<a href="https://colab.research.google.com/github/seremmartin64-ops/ML/blob/main/Clustering_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Unsupervised Machine Learning
# Def: It deals with data without labelled responses and mainly used to find pattern, structures and relationship within data
# ETL(Extract, Transfrom, Load)
# Extract: SQL, Web Scrapping, Kaggle, JSON...
# Transform:

# Types of Unsupervised ML
# a) Clustering: Grouping data based on its similar features - KMeans(average values - Numerical Data)
# b) Association: One data point is linked with another data point(Apriori Algorithm)


# CLUSTERING
# Customer Segmentation: Grouping customers based on their similar purchasing behaviour.

In [None]:
# STEP1: Read the Mall Customers Data

import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("vjchoudhary7/customer-segmentation-tutorial-in-python")
print(path)
print(os.listdir(path))


file_path = os.path.join(path, "Mall_Customers.csv")
data = pd.read_csv(file_path)
data.head(10)



Using Colab cache for faster access to the 'customer-segmentation-tutorial-in-python' dataset.
/kaggle/input/customer-segmentation-tutorial-in-python
['Mall_Customers.csv']


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
5,6,Female,22,17,76
6,7,Female,35,18,6
7,8,Female,23,18,94
8,9,Male,64,19,3
9,10,Female,30,19,72


In [None]:
# STEP2: Data Transformation
# Check and Remove Empty/Missing Records
# a) Remove the Empty Record
# b) Replacing the Empty Records

data.isnull().sum() # Detect
data.dropna(inplace=True) # Remove

In [None]:
numbers = [1, 2, 3, 4, 5]
print(numbers[2:5])

[3, 4, 5]


In [None]:
# STEP3: Unsupervised Learning
# Create a variable X to store our Features ONLY, There is no Y
# Age, Income, Spending

array = data.values
array.shape

X = array[:, 2:5]
X.shape

(200, 3)

### The Elbow Method for Choosing Optimal Clusters

The **Elbow Method** is a heuristic used to determine the optimal number of clusters in a **K-Means clustering algorithm**.  
It involves plotting the **explained variation** (or **inertia**) as a function of the number of clusters and selecting the point where the rate of decrease sharply changes — known as the **"elbow"** point.

The explained variation is measured using the **Within-Cluster Sum of Squares (WCSS)**, also referred to as the **Sum of Squared Distances** between the data points and their assigned cluster centers.

#### **Intuition Behind the Elbow Method**
- As the number of clusters increases, the **WCSS decreases**, since data points are assigned to clusters that better fit them.
- However, after a certain number of clusters, the **improvement in WCSS reduction becomes marginal**.
- The point at which this reduction starts to level off forms an **elbow shape** in the plot.
- This **"elbow" point** is considered the **optimal number of clusters**, balancing model simplicity and accuracy.


In [None]:
# We reached the Optimal Number, when the inertial is lower, Noramally near the Elbow Arm.
# OOP(encapsulation, inheritance, abstration, polymorphism)

from sklearn.cluster import KMeans
inertias = []
for k in range(2, 15):
  model = KMeans(n_clusters=k, random_state=42)
  model.fit(X)
  inertias.append(model.inertia_)

print(inertias)

[221087.1962719298, 158744.97108013942, 104366.151455562, 97211.84353980474, 68275.94428646985, 51448.36126259325, 44640.028048530425, 42081.855308685335, 38378.73890793209, 36521.06627366099, 35243.34881334352, 32308.587172476648, 29711.159791524264]


In [None]:
# # PLOT THE INERTIAS AS AN ELBOW
# import matplotlib.pyplot as plt
# plt.plot(range(2, 15), inertias, marker='o')
# plt.title('Elbow Method')
# plt.xlabel('Number of Clusters')
# plt.ylabel('Inertia')
# plt.show()


import plotly.express as px
import pandas as pd

# Assuming you already have a list of inertias
# and corresponding k values
k_values = list(range(2, 15))
df = pd.DataFrame({'No. of Clusters (k)': k_values, 'Inertia': inertias})

# Create an interactive line plot
fig = px.line(
    df,
    x='No. of Clusters (k)',
    y='Inertia',
    title='Elbow Method (Interactive)',
    markers=True,
)

# Customize appearance
fig.update_traces(marker=dict(size=6))
fig.update_layout(
    xaxis_title='Number of Clusters (k)',
    yaxis_title='Inertia',
    template='plotly_white'
)

# Show interactive plot
fig.show()


In [None]:
# Step4: Clustering using KMeans
# Import the Clustering Algorith and Fit the X data
# Here we use the KMeans Algorithm: It clusters groups by the Nearest Mean(K)
# KMeans ALgorithm only work with Numerical Data because the groups are based on the Mean
# AGE
# Gen X: 45
# Gen Y: 30
# Gen Z: 20

# INCOME:
# Low Income : 10000
# Middle Income : 60000...

# Elbow Method: A mechanism to determine the Optimal Number Clusters


from sklearn.cluster import KMeans
model = KMeans(n_clusters=8, random_state=42)
model.fit(X)

In [None]:
# STEP5: Generate the Clusters Means(K)
# In K-Means Clustering, a cluster center (also called a centroid) is the mean position of all the data points that belong to a particular cluster.
centroids = model.cluster_centers_
centroids

array([[ 56.34090909,  53.70454545,  49.38636364],
       [ 33.        , 114.71428571,  78.42857143],
       [ 32.625     ,  80.375     ,  82.9375    ],
       [ 41.96      ,  79.64      ,  15.4       ],
       [ 25.27272727,  25.72727273,  79.36363636],
       [ 44.31818182,  25.77272727,  20.27272727],
       [ 27.        ,  56.65789474,  49.13157895],
       [ 41.        , 109.7       ,  22.        ]])

In [None]:
# STEP6: Store the Centroids in a DataFrame
# It is data that is represented in a Table.(rows, columns), Excel or a Relational Databases(SQL)
# The columns are labelled

centroid_dataframe = pd.DataFrame(centroids, columns=["Customer Age", "Annual Income", "Spending Score"])
centroid_dataframe

Unnamed: 0,Customer Age,Annual Income,Spending Score
0,56.340909,53.704545,49.386364
1,33.0,114.714286,78.428571
2,32.625,80.375,82.9375
3,41.96,79.64,15.4
4,25.272727,25.727273,79.363636
5,44.318182,25.772727,20.272727
6,27.0,56.657895,49.131579
7,41.0,109.7,22.0


In [None]:
# STEP7: We Assign Members to thier Corresponding Clusters
# Start by generating the cluster Labels
# Create a New Column(Cluster) to store the cluster labels

data["Cluster"] = model.labels_
data

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
0,1,Male,19,15,39,5
1,2,Male,21,15,81,4
2,3,Female,20,16,6,5
3,4,Female,23,16,77,4
4,5,Female,31,17,40,5
...,...,...,...,...,...,...
195,196,Female,35,120,79,1
196,197,Female,45,126,28,7
197,198,Male,32,126,74,1
198,199,Male,32,137,18,7


In [None]:
# Generating Members of the Same Clusters
# Generate 3 Members

condition = data["Cluster"] == 3
cluster_3 = data[condition]
cluster_3

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
126,127,Male,43,71,35,3
128,129,Male,59,71,11,3
130,131,Male,47,71,9,3
134,135,Male,20,73,5,3
136,137,Female,44,73,7,3
138,139,Male,19,74,10,3
140,141,Female,57,75,5,3
144,145,Male,25,77,12,3
146,147,Male,48,77,36,3
148,149,Female,34,78,22,3


In [None]:
# Cluster 4 Members
condition = data["Cluster"] == 4
cluster4 = data[condition]
cluster4

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
1,2,Male,21,15,81,4
3,4,Female,23,16,77,4
5,6,Female,22,17,76,4
7,8,Female,23,18,94,4
9,10,Female,30,19,72,4
11,12,Female,35,19,99,4
13,14,Female,24,20,77,4
15,16,Male,22,20,79,4
17,18,Male,20,21,66,4
19,20,Female,35,23,98,4
