# Spending Pattern Analysis with K-Means (Clustering)

**Objective:**
Implement customer spending segmentation using **K-Means** on `Income_$` and `SpendingScore`. Evaluate multiple values of **k** with the Elbow check, choose the best clusters, and evaluate using Silhouette Score and Davies-Bouldin Index (DBI).

In [1]:
# --------------------------------
# 0) Imports
# --------------------------------
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score

RANDOM_STATE = 42  # For reproducibility

## 1. Load Dataset

We use the `spending_l9_dataset.csv` dataset.

In [2]:
df = pd.read_csv("spending_l9_dataset.csv")
print(df.head())

   CustomerID  Age  Income_$  SpendingScore  VisitsPerMonth  OnlinePurchases  \
0           1   28        33             78              14                9   
1           2   21        25             87               8               23   
2           3   23        24             88              13               10   
3           4   24        25             73              16               11   
4           5   20        23             88              17               16   

   Gender Region  
0  Female   East  
1    Male  North  
2    Male  South  
3  Female   West  
4    Male   West  


## 2. Prepare Features

We select `Income_$` and `SpendingScore`, fill any missing values with the median, and apply `StandardScaler` to ensure K-Means computes distances fairly.

In [3]:
FEATURES = ["Income_$", "SpendingScore"]
X = df[FEATURES].copy()

# Fill missing numeric values with median (if any)
for col in FEATURES:
    if X[col].isna().any():
        X[col] = X[col].fillna(X[col].median())

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Scaled shape:", X_scaled.shape)

Scaled shape: (200, 2)


## 3. Elbow Check (SSE)

We calculate the Sum of Squared Errors (SSE) for $k$ values from 1 to 10 to find the "elbow point" where improvements slow down.

In [4]:
print("=== ELBOW METHOD (SSE per k) ===")
for k in range(1, 11):
    km = KMeans(n_clusters=k, n_init="auto", random_state=RANDOM_STATE)
    km.fit(X_scaled)
    print(f"k={k} → SSE={km.inertia_:.2f}")

=== ELBOW METHOD (SSE per k) ===
k=1 → SSE=400.00
k=2 → SSE=199.70
k=3 → SSE=79.37
k=4 → SSE=21.37
k=5 → SSE=19.09
k=6 → SSE=15.65
k=7 → SSE=14.48
k=8 → SSE=13.81
k=9 → SSE=12.94
k=10 → SSE=11.52


## 4. Model Training (Pick K)

Based on the SSE drop, the "elbow" forms at **K=5** (the SSE explicitly flattens significantly between 5 and 6). We fit our final model with 5 clusters.

In [5]:
K = 5
kmeans = KMeans(n_clusters=K, n_init="auto", random_state=RANDOM_STATE)
labels = kmeans.fit_predict(X_scaled)

# Add the predicted cluster back to the dataframe
df["Cluster"] = labels.astype(int)

## 5. Evaluate Clustering

We compute the **Silhouette Score** (closer to 1 is better) and the **Davies–Bouldin Index** (closer to 0 is better).

In [6]:
sil = silhouette_score(X_scaled, labels)
dbi = davies_bouldin_score(X_scaled, labels)
print("=== METRICS ===")
print(f"Silhouette Score : {sil:.3f}")
print(f"Davies–Bouldin   : {dbi:.3f}")

=== METRICS ===
Silhouette Score : 0.642
Davies–Bouldin   : 0.571


## 6. Cluster Centers (Original Units)

Since our model was trained on scaled data, we inverse-transform the cluster centers to see their actual values in `Income_$` and `SpendingScore`.

In [7]:
centers_scaled = kmeans.cluster_centers_
centers_original = scaler.inverse_transform(centers_scaled)

centers_df = pd.DataFrame(centers_original, columns=FEATURES)
centers_df.index.name = "Cluster"

print("=== CLUSTER CENTERS (Original Units) ===")
print(centers_df.round(2))

=== CLUSTER CENTERS (Original Units) ===
         Income_$  SpendingScore
Cluster                         
0           56.32          53.58
1           28.92          19.60
2           25.33          78.04
3           99.16          79.24
4           22.74          89.04


## 7. Sanity Check

We print off 3 random customer samples to see their values and their assigned cluster.

In [8]:
sample_idx = [15, 62, 114]
sanity = df.loc[sample_idx, FEATURES + ["Cluster"]]
print("\n=== SANITY CHECK ===")
print(sanity)


=== SANITY CHECK ===
     Income_$  SpendingScore  Cluster
15         19             86        4
62         68             51        0
114        43             14        1


## 8. Save Output

Exported the final labeled data to a CSV file.

In [9]:
OUT_PATH = "spending_labeled_clusters.csv"
df.to_csv(OUT_PATH, index=False)
print(f"\nSaved labeled clusters to: {OUT_PATH}")


Saved labeled clusters to: spending_labeled_clusters.csv
