<a href="https://colab.research.google.com/github/yulita231/Learning-Journey/blob/main/Copy_of_Day_5_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Clustering on e-commerce data**

Kita memiliki suatu dataset dari University of California Irvine yang membahas tentang data online retail atau e-commerce (https://archive.ics.uci.edu/ml/datasets/online+retail).

Di sini kita dapat melakukan pembelajaran tentang produk-produk yang menghasilkan revenue tertinggi.

Dari pembelajaran ini, harapannya kita dapat memberikan solusi yang tepat dalam memberikan keputusan bisnis tentang produk mana yang dapat kita tingkatkan

In [None]:
import pandas as pd
import datetime as dt
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import joblib
import os

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples


# Import library dari Google Colab untuk upload file
# tujuan: memungkinkan user upload file langsung dari komputer ke Colab
from google.colab import files

pd.options.mode.chained_assignment = None  # default='warn'

## **Import Data**

In [None]:
# Buka dialog upload file di Google Colab
# tujuan: agar user bisa memilih file CSV/Excel dari komputer lokal
uploaded = files.upload()

# Ambil nama file pertama dari hasil upload (berbentuk dict)
# files.upload() menghasilkan dictionary {filename: filecontent}
# dengan baris ini kita ambil key (nama filenya) saja
filename = next(iter(uploaded))

# Cetak nama file yang berhasil diupload
# tujuan: memastikan file sudah benar dan siap dibaca dengan pandas
print(filename)

Saving clean_ecommerce_ready.csv to clean_ecommerce_ready.csv
clean_ecommerce_ready.csv


In [None]:
df = pd.read_csv(filename, sep = ";")

df.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
df.shape

(17139, 12)

In [None]:
df.describe(include = 'all')

Unnamed: 0,CustomerID,StockCode,Description,Country,UnitPrice,InvoiceDate,InvoiceNo,Quantity,total_purchase,revenue,gap_time,recency
count,17139.0,17139,17139,17139,17139.0,17139,17139.0,17139.0,17139.0,17139.0,17139,17139.0
unique,,2598,2667,1,,8218,,,,,8218,
top,,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom,,2011-12-05 17:17:00,,,,,5 days 06:42:59,
freq,,103,103,17139,,30,,,,,30,
mean,15541.432347,,,,2.884036,,560610.782834,11.989673,11.964525,20.523361,,152.618297
std,1590.240893,,,,4.930906,,13104.432832,47.299645,47.281585,67.579785,,112.731506
min,12747.0,,,,0.04,,536365.0,1.0,1.0,0.19,,1.0
25%,14194.0,,,,1.25,,549235.0,2.0,2.0,4.2,,50.0
50%,15502.0,,,,1.95,,561889.0,5.0,5.0,10.5,,132.0
75%,16931.0,,,,3.75,,572220.5,12.0,12.0,18.0,,247.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17139 entries, 0 to 17138
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CustomerID      17139 non-null  int64  
 1   StockCode       17139 non-null  object 
 2   Description     17139 non-null  object 
 3   Country         17139 non-null  object 
 4   UnitPrice       17139 non-null  float64
 5   InvoiceDate     17139 non-null  object 
 6   InvoiceNo       17139 non-null  int64  
 7   Quantity        17139 non-null  int64  
 8   total_purchase  17139 non-null  float64
 9   revenue         17139 non-null  float64
 10  gap_time        17139 non-null  object 
 11  recency         17139 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 1.6+ MB


In [None]:
df['CustomerID'] = df['CustomerID'].astype(str)

In [None]:
df.isna().sum()

Unnamed: 0,0
CustomerID,0
StockCode,0
Description,0
Country,0
UnitPrice,0
InvoiceDate,0
InvoiceNo,0
Quantity,0
total_purchase,0
revenue,0


## **Eksplorasi Data**

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

UnitPrice: Unit price. Numeric, Product price per unit in sterling.

CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal, the name of the country where each customer resides.

Sumber = https://archive.ics.uci.edu/ml/datasets/online+retail

**Products**

Kita hitung jumlah produk, transaksi, dan pelanggan unique dari dataset yang sudah clean

In [None]:
clean_data = df

pd.DataFrame([{'products': len(clean_data['StockCode'].value_counts()),
               'transactions': len(clean_data['InvoiceNo'].value_counts()),
               'customers': len(clean_data['CustomerID'].value_counts()),
              }], columns = ['products', 'transactions', 'customers'], index = ['quantity'])

Unnamed: 0,products,transactions,customers
quantity,2598,8484,2949


In [None]:
# 1) Grouping data berdasarkan StockCode dan Description (kode & nama produk).
#    Dari setiap kombinasi produk, jumlahkan total Quantity & revenue.
#    Hasilnya kemudian diurutkan berdasarkan revenue dari yang terbesar ke terkecil.
product_group = (clean_data.groupby(['StockCode', 'Description'], as_index=False)[['Quantity', 'revenue']]
                            .agg('sum')
                            .sort_values('revenue', ascending=False))

# 2) Tampilkan 20 produk teratas berdasarkan revenue.
print(product_group.head(20))

# 3) Buat bar chart dengan Plotly Express:
#    - x = Description (nama produk)
#    - y = revenue
#    - custom_data menyimpan StockCode & Quantity (untuk ditampilkan di tooltip).
fig = px.bar(
    product_group.head(20),
    x='Description',
    y='revenue',
    custom_data=[product_group['StockCode'].head(20),
                 product_group['Quantity'].head(20)]
)

# 4) Atur tampilan tooltip saat hover:
#    - %{x} menampilkan Description (nama produk)
#    - %{customdata[0]} menampilkan StockCode
#    - %{customdata[0]} lagi (seharusnya Quantity, tapi salah index → bug kecil di sini)
#    - %{value} menampilkan nilai revenue (lebih aman pakai %{y})
fig.update_traces(
    hovertemplate='Description: %{x}<br>'
                  'StockCode: %{customdata[0]}<br>'
                  'Quantity: %{customdata[0]}<br>'  # BUG: ini mestinya customdata[1]
                  'revenue: %{value}'
)

# 5) Tampilkan grafik interaktif.
fig.show()


     StockCode                         Description  Quantity  revenue
2550    85123A  WHITE HANGING HEART T-LIGHT HOLDER      2277  6490.93
1022     22423            REGENCY CAKESTAND 3 TIER       522  5640.00
2532     85066         CREAM SWEETHEART MINI CHEST       307  3790.46
258      21137            BLACK RECORD COVER FRAME      1095  3726.09
2262     79321                       CHILLI LIGHTS       780  3725.13
2540    85099B             JUMBO BAG RED RETROSPOT      1721  3190.13
2187     48187                 DOORMAT NEW ENGLAND       612  3151.80
1267     22693  GROW A FLYTRAP OR SUNFLOWER IN TIN      3216  3051.36
2268     82484   WOOD BLACK BOARD ANT WHITE FINISH       544  2947.38
1812     23284       DOORMAT KEEP CALM AND COME IN       462  2824.60
2180     48111               DOORMAT 3 SMILEY CATS       520  2438.50
2432     84879       ASSORTED COLOUR BIRD ORNAMENT      1509  2435.01
1276     22702            BLACK AND WHITE CAT BOWL       987  2214.30
2165     47566      

Kita lihat juga untuk pelanggan

In [None]:
# 1) Agregasi per kombinasi pelanggan–produk:
#    - kelompokkan berdasarkan CustomerID, StockCode, Description
#    - jumlahkan Quantity & revenue untuk tiap kombinasi
cust_prod = (clean_data
             .groupby(['CustomerID', 'StockCode', 'Description'], as_index=False)[['Quantity', 'revenue']]
             .sum())

# 2) Urutkan seluruh kombinasi itu berdasarkan revenue menurun,
#    sehingga baris pertama untuk setiap CustomerID adalah produk dengan revenue terbesar.
cust_prod_sorted = cust_prod.sort_values('revenue', ascending=False)

# 3) Ambil satu baris teratas per CustomerID (produk andalan tiap pelanggan),
#    lalu rapikan urutannya lagi berdasarkan revenue (menurun) dan reset index.
cust_top = (cust_prod_sorted
            .drop_duplicates(subset='CustomerID', keep='first')
            .sort_values('revenue', ascending=False)
            .reset_index(drop=True))

# 4) Ambil 20 pelanggan teratas berdasarkan revenue produk andalannya.
top20 = cust_top.head(20)

# 5) Buat bar chart:
#    - x = CustomerID
#    - y = revenue
#    - custom_data kirim nama kolom agar tooltip bisa menampilkan StockCode, Description, Quantity
import plotly.express as px
fig = px.bar(
    top20,
    x='CustomerID',
    y='revenue',
    custom_data=['StockCode', 'Description', 'Quantity']
)

# 6) Atur template tooltip saat hover:
#    - %{x} = CustomerID
#    - %{customdata[0]} = StockCode
#    - %{customdata[1]} = Description
#    - %{customdata[2]} = Quantity
#    - %{y} = nilai revenue (lebih aman ketimbang %{value})
fig.update_traces(
    hovertemplate='CustomerID: %{x}<br>'
                  'StockCode: %{customdata[0]}<br>'
                  'Description: %{customdata[1]}<br>'
                  'Quantity: %{customdata[2]}<br>'
                  'revenue: %{y}'
)

# 7) Tampilkan grafik interaktif.
fig.show()

In [None]:
# 1) Kelompokkan data berdasarkan CustomerID.
#    Hitung jumlah total Quantity dan revenue untuk tiap pelanggan.
#    Hasilnya berupa DataFrame baru dengan 1 baris per CustomerID.
#    Lalu urutkan pelanggan berdasarkan revenue secara menurun.
agg = (clean_data.groupby('CustomerID', as_index=False)[['Quantity', 'revenue']]
                 .agg('sum')
                 .sort_values('revenue', ascending=False))

# 2) Ambil 20 pelanggan teratas dengan revenue tertinggi.
top20 = agg.head(20)

# 3) Buat bar chart interaktif menggunakan Plotly Express:
#    - Sumbu X menampilkan CustomerID
#    - Sumbu Y menampilkan revenue
#    - custom_data menyertakan kolom Quantity supaya bisa ditampilkan di tooltip
fig = px.bar(
    top20,
    x='CustomerID',
    y='revenue',
    custom_data=['Quantity']  # pakai nama kolom, lebih rapi
)

# 4) Ubah template tooltip (info yang muncul saat hover):
#    - %{x} = CustomerID
#    - %{customdata[0]} = Quantity
#    - %{y} = revenue (nilai sumbu Y)
fig.update_traces(
    hovertemplate='CustomerID: %{x}<br>'
                  'Quantity: %{customdata[0]}<br>'
                  'revenue: %{y}'
)

# 5) Tampilkan grafik di browser.
fig.show()

## **Feature Engineering**

Kita akan buat beberapa segmentasi berdasarkan produk, negara asal, dan juga pelanggan. Pendekatan yang kita lakukan adalah segmentasi RFM (Recency, Frequency, Monetary) sederhana.

Recency = rentang waktu antara waktu pembelian hingga waktu terbaru, karena data yang kita miliki lumayan sudah lama maka kita batasi waktu terkini adalah H+1 dari pembelian terakhir pada data

Frequency = Jumlah pembelian, menggunakan kolom Quantity

Monetary = Pendapatan yang dihasilkan, menggunakan kolom revenue

**Feature Engineering - RFM**

Kita buat kolom recency untuk mengetahui rentang waktu antara waktu pembelian hingga waktu terbaru, maka kita cari dahulu pada clean_data tanggal maksimum dari pembelian

In [None]:
clean_data['InvoiceDate'].describe()

Unnamed: 0,InvoiceDate
count,17139
unique,8218
top,2011-12-05 17:17:00
freq,30


In [None]:
# 1) Pastikan InvoiceDate bertipe datetime
clean_data['InvoiceDate'] = pd.to_datetime(clean_data['InvoiceDate'])

# 2) Ambil tanggal max, tambahkan 1 hari, set ke 23:59:59
recent_time = (
    clean_data['InvoiceDate'].max()
    + pd.Timedelta(days=1)
).replace(hour=23, minute=59, second=59)

print(recent_time)

2011-12-10 23:59:59


Kita dapatkan waktu H+1 setelah pembelian, sekarang kita buat rentang waktu dalam hari untuk setiap pembelian

In [None]:
clean_data['gap_time'] = (recent_time - clean_data['InvoiceDate'])
clean_data['recency'] = clean_data['gap_time'].dt.days
print(clean_data[clean_data['InvoiceDate'] == '2011-12-09 12:50:00'])
clean_data.head()

Empty DataFrame
Columns: [CustomerID, StockCode, Description, Country, UnitPrice, InvoiceDate, InvoiceNo, Quantity, total_purchase, revenue, gap_time, recency]
Index: []


Unnamed: 0,CustomerID,StockCode,Description,Country,UnitPrice,InvoiceDate,InvoiceNo,Quantity,total_purchase,revenue,gap_time,recency
0,13744,21670,BLUE SPOT CERAMIC DRAWER KNOB,United Kingdom,1.25,2011-02-20 14:08:00,544461,12,12.0,15.0,293 days 09:51:59,293
1,14081,22066,LOVE HEART TRINKET POT,United Kingdom,1.45,2011-02-10 16:17:00,543631,6,6.0,8.7,303 days 07:42:59,303
2,15311,22752,SET 7 BABUSHKA NESTING BOXES,United Kingdom,8.5,2011-06-07 13:56:00,555855,1,1.0,8.5,186 days 10:03:59,186
3,15006,72817,SET OF 2 CHRISTMAS DECOUPAGE CANDLE,United Kingdom,0.79,2011-10-19 15:08:00,571909,2,2.0,1.58,52 days 08:51:59,52
4,16360,22383,LUNCH BAG SUKI DESIGN,United Kingdom,1.65,2011-11-20 11:56:00,577485,2,2.0,3.3,20 days 12:03:59,20


Kolom RFM sudah kita miliki semua dan kita siap ke step selanjutnya

**Feature Engineering - Scaling**

K-means adalah suatu algoritma unsupervised yang tujuannya untuk pengelompokkan data dengan kemiripan karateristik dari masing-masing atribut yang menggunakan euclidean distance sebagai penghitung jarak antar data yang kemudian dirata-ratakan untuk menghasilkan centroid.

perlu diingat, jika beurusan dengan euclidean distance dan "means" sebagai nilai tengah (centroid) maka data harus terdistribusi mendekati normal. Apabila tidak memenehui hal tersebut maka perlu dilakukan scaling atau pemilihan metode lain seperti k-median

https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7

https://mull-over-things.com/is-scaling-required-for-k-means-clustering/


Pada dasarnya RFM ini memiliki unit yang berbeda (waktu, jumlah, dan uang) sehingga pemilihan normalisasi harus tepat untuk data yang variatif dan juga skew tinggi

https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf

https://www.statology.org/standardization-vs-normalization/

https://www.geeksforgeeks.org/normalization-vs-standardization/

In [None]:
# 1) Buat subplot dengan 3 baris, 1 kolom
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=(
        "Histogram of recency",
        "Histogram of frequency/quantity",
        "Histogram of monetary/revenue"
    )
)

# 2) Histogram recency (warna merah, dengan legend "Recency")
hist1 = px.histogram(clean_data, x="recency", color_discrete_sequence=['red'])
trace1 = hist1.data[0]
trace1.name = "Recency"
trace1.showlegend = True
fig.add_trace(trace1, row=1, col=1)

# 3) Histogram Quantity (warna emas, dengan legend "Quantity")
hist2 = px.histogram(clean_data, x="Quantity", color_discrete_sequence=['gold'])
trace2 = hist2.data[0]
trace2.name = "Quantity"
trace2.showlegend = True
fig.add_trace(trace2, row=2, col=1)

# 4) Histogram revenue (warna hijau, dengan legend "Revenue")
hist3 = px.histogram(clean_data, x="revenue", color_discrete_sequence=['green'])
trace3 = hist3.data[0]
trace3.name = "Revenue"
trace3.showlegend = True
fig.add_trace(trace3, row=3, col=1)

# 5) Atur layout: aktifkan legend
fig.update_layout(
    height=900,
    width=1500,
    showlegend=True,
    legend=dict(
        orientation="h",  # horizontal legend
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    )
)

# 6) Tampilkan plot
fig.show()


**Product RFM**

In [None]:
product_recency = clean_data.groupby(by = 'StockCode', as_index = False)[['recency']].min()
product_fm = clean_data.groupby(by = 'StockCode', as_index = False)[['Quantity', 'revenue']].sum()

product_rfm = pd.merge(product_recency, product_fm, how = 'inner', left_on = 'StockCode', right_on = 'StockCode')
product_rfm

Unnamed: 0,StockCode,recency,Quantity,revenue
0,10080,26,12,4.68
1,10120,6,8,1.68
2,10123C,281,3,1.95
3,10124A,268,4,1.68
4,10124G,373,5,2.10
...,...,...,...,...
2593,90214C,207,1,1.25
2594,90214G,3,1,0.29
2595,90214I,2,1,0.29
2596,90214K,184,3,3.75


**Product RFM - Feature Engineering - Scaling Recency, Quantity, & Revenue**

In [None]:
# 1) Normalisasi Recency ke [0,1] (min-max scaling)
a, b = 0, 1
x, y = product_rfm.recency.min(), product_rfm.recency.max()
product_rfm['recency_norm'] = (product_rfm.recency - x) / (y - x) * (b - a) + a

# 2) Normalisasi Quantity ke [0,1]
x, y = product_rfm.Quantity.min(), product_rfm.Quantity.max()
product_rfm['quantity_norm'] = (product_rfm.Quantity - x) / (y - x) * (b - a) + a

# 3) Normalisasi Revenue ke [0,1]
x, y = product_rfm.revenue.min(), product_rfm.revenue.max()
product_rfm['revenue_norm'] = (product_rfm.revenue - x) / (y - x) * (b - a) + a

**Tampilkan hasil Scaling**

In [None]:
# 1) Buat subplot grid 3 baris × 1 kolom
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=("Histogram of Recency (normalized)",
                    "Histogram of Quantity (normalized)",
                    "Histogram of Revenue (normalized)")
)

# 2) Histogram recency (warna merah, dengan legend "Recency")
hist1 = px.histogram(product_rfm, x="recency_norm", color_discrete_sequence=['red'])
trace1 = hist1.data[0]
trace1.name = "Recency"
trace1.showlegend = True
fig.add_trace(trace1, row=1, col=1)

# 3) Histogram Quantity (warna emas, dengan legend "Quantity")
hist2 = px.histogram(product_rfm, x="quantity_norm", color_discrete_sequence=['gold'])
trace2 = hist2.data[0]
trace2.name = "Quantity"
trace2.showlegend = True
fig.add_trace(trace2, row=2, col=1)

# 4) Histogram revenue (warna hijau, dengan legend "Revenue")
hist3 = px.histogram(product_rfm, x="revenue_norm", color_discrete_sequence=['green'])
trace3 = hist3.data[0]
trace3.name = "Revenue"
trace3.showlegend = True
fig.add_trace(trace3, row=3, col=1)

# 5) Atur layout: aktifkan legend
fig.update_layout(
    height=900,
    width=1500,
    showlegend=True,
    legend=dict(
        orientation="h",  # horizontal legend
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    )
)

# 6) Tampilkan grafik interaktif
fig.show()

**Product RFM - Feature Engineering - Log Scaling**

Hasil scaling min max pun terlihat sangat jauh distribusinya, Kita coba menggunakan logaritmic scaling yang fungsinya untuk mengatasi masalah skewness yang tinggi. simplenya adalah kita berikan fungsi log(x) pada setiap data

https://www.forbes.com/sites/naomirobbins/2012/01/19/when-should-i-use-logarithmic-scales-in-my-charts-and-graphs/?sh=153b409c5e67

In [None]:
# 1) Transformasi log untuk Recency, lalu normalisasi ke [0,1]
product_rfm['recency_log'] = np.log(product_rfm['recency'])
a, b = 0, 1
x, y = product_rfm.recency_log.min(), product_rfm.recency_log.max()
product_rfm['recency_lognorm'] = (product_rfm.recency_log - x) / (y - x) * (b - a) + a

# 2) Transformasi log untuk Quantity, lalu normalisasi ke [0,1]
product_rfm['quantity_log'] = np.log(product_rfm['Quantity'])
x, y = product_rfm.quantity_log.min(), product_rfm.quantity_log.max()
product_rfm['quantity_lognorm'] = (product_rfm.quantity_log - x) / (y - x) * (b - a) + a

# 3) Transformasi log untuk Revenue, lalu normalisasi ke [0,1]
product_rfm['revenue_log'] = np.log(product_rfm['revenue'])
x, y = product_rfm.revenue_log.min(), product_rfm.revenue_log.max()
product_rfm['revenue_lognorm'] = (product_rfm.revenue_log - x) / (y - x) * (b - a) + a

In [None]:
# 4) Buat subplot dengan 3 baris × 1 kolom
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=("Histogram of Recency (log-normalized)",
                    "Histogram of Quantity (log-normalized)",
                    "Histogram of Revenue (log-normalized)")
)

# 2) Histogram recency (warna merah, dengan legend "Recency")
hist1 = px.histogram(product_rfm, x="recency_lognorm", color_discrete_sequence=['red'])
trace1 = hist1.data[0]
trace1.name = "Recency"
trace1.showlegend = True
fig.add_trace(trace1, row=1, col=1)

# 3) Histogram Quantity (warna emas, dengan legend "Quantity")
hist2 = px.histogram(product_rfm, x="quantity_lognorm", color_discrete_sequence=['gold'])
trace2 = hist2.data[0]
trace2.name = "Quantity"
trace2.showlegend = True
fig.add_trace(trace2, row=2, col=1)

# 4) Histogram revenue (warna hijau, dengan legend "Revenue")
hist3 = px.histogram(product_rfm, x="revenue_lognorm", color_discrete_sequence=['green'])
trace3 = hist3.data[0]
trace3.name = "Revenue"
trace3.showlegend = True
fig.add_trace(trace3, row=3, col=1)

# 5) Atur layout: aktifkan legend
fig.update_layout(
    height=900,
    width=1500,
    showlegend=True,
    legend=dict(
        orientation="h",  # horizontal legend
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    )
)

# 6) Tampilkan grafik interaktif
fig.show()

Kita lihat bahwa hasil log kemudian normalisasi memiliki perubahan distribusi yang signifikan dan dua kolom data hampir mendekati normal, sehingga kita sudah siap untuk ke tahap selanjutnya.

## Clustering

**Clustering Product**

Dengan data yang sudah kita scaling menjadi log-normalized kita akan coba untuk membuat kelompok produk berdasarkan RFM

In [None]:
# =======================================================
# CLUSTERING PIPELINE (KMeans + DBSCAN) — satu loop terstruktur
# =======================================================

import numpy as np
import pandas as pd

# 1) Siapkan fitur (asumsi sudah log-normalized seperti sebelumnya)
X = product_rfm[['recency_lognorm', 'quantity_lognorm', 'revenue_lognorm']]

In [None]:
# 2) DEFINISI MODEL: satukan semua kandidat dalam satu dict (seperti contoh klasifikasi)
models = {}

# 2a) KMeans dengan variasi jumlah cluster (2..10)
for k in range(2, 11):
    models[f"KMeans_k={k}"] = KMeans(n_clusters=k, random_state=9)

# 2b) DBSCAN dengan grid eps × min_samples
eps_values = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
min_samples_values = [3, 5, 10]
for eps in eps_values:
    for ms in min_samples_values:
        models[f"DBSCAN_eps={eps}_min={ms}"] = DBSCAN(eps=eps, min_samples=ms)

In [None]:
# 3) TRAIN + EVALUASI: loop tunggal untuk semua model
records = []  # akan jadi baris-baris pada tabel hasil

for name, mdl in models.items():
    # ---- TRAIN ----
    # Untuk KMeans: .fit(X) -> atribut labels_ tersedia
    # Untuk DBSCAN: bisa .fit(X) atau .fit_predict(X); kita panggil .fit_predict agar dapat labels langsung
    labels = getattr(mdl, "fit_predict", None)
    if callable(labels):
        labels = mdl.fit_predict(X)
    else:
        mdl.fit(X)
        labels = mdl.labels_

    # ---- EVALUASI UMUM ----
    # Hitung jumlah cluster: khusus DBSCAN, label -1 = noise dan tidak dihitung sebagai cluster
    if isinstance(mdl, DBSCAN):
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    else:
        n_clusters = getattr(mdl, "n_clusters", None)  # KMeans punya atribut ini

    # Silhouette hanya valid jika cluster > 1
    silh_avg = silhouette_score(X, labels) if (n_clusters is not None and n_clusters > 1) else np.nan

    # ---- METRIK KHUSUS (KMeans) ----
    # inertia_ hanya ada pada KMeans (SSE untuk elbow method)
    inertia = getattr(mdl, "inertia_", np.nan)

    # ---- SIMPAN HASIL ----
    records.append({
        "Model": name,                          # nama model + parameternya
        "Type": type(mdl).__name__,             # tipe model (KMeans atau DBSCAN)
        "Clusters": n_clusters,                 # banyaknya cluster (DBSCAN tidak menghitung noise)
        "Silhouette_Score": silh_avg,           # kualitas pemisahan cluster (-1..1; makin besar makin baik)
        "Inertia": inertia                      # hanya bermakna untuk KMeans; DBSCAN akan NaN
    })
    print(f"✔ Selesai: {name} | clusters={n_clusters} | silhouette={silh_avg:.3f} | inertia={inertia if not np.isnan(inertia) else '—'}")

✔ Selesai: KMeans_k=2 | clusters=2 | silhouette=0.403 | inertia=165.46243556085278
✔ Selesai: KMeans_k=3 | clusters=3 | silhouette=0.305 | inertia=131.99967400388826
✔ Selesai: KMeans_k=4 | clusters=4 | silhouette=0.321 | inertia=99.32123689130377
✔ Selesai: KMeans_k=5 | clusters=5 | silhouette=0.308 | inertia=82.17133462437076
✔ Selesai: KMeans_k=6 | clusters=6 | silhouette=0.289 | inertia=73.64683504094725
✔ Selesai: KMeans_k=7 | clusters=7 | silhouette=0.303 | inertia=62.79634970183645
✔ Selesai: KMeans_k=8 | clusters=8 | silhouette=0.294 | inertia=56.82973127676564
✔ Selesai: KMeans_k=9 | clusters=9 | silhouette=0.277 | inertia=52.94834904066928
✔ Selesai: KMeans_k=10 | clusters=10 | silhouette=0.275 | inertia=48.5008244095901
✔ Selesai: DBSCAN_eps=0.1_min=3 | clusters=5 | silhouette=-0.004 | inertia=—
✔ Selesai: DBSCAN_eps=0.1_min=5 | clusters=4 | silhouette=0.160 | inertia=—
✔ Selesai: DBSCAN_eps=0.1_min=10 | clusters=1 | silhouette=nan | inertia=—
✔ Selesai: DBSCAN_eps=0.2_min=3

In [None]:
# 4) RANGKUM HASIL KE TABEL
results_df = pd.DataFrame(records)

# 5) URUTKAN: prioritas utama silhouette (desc), sekunder inertia (asc) untuk KMeans; NaN ditaruh di akhir
results_df = results_df.sort_values(
    by=["Silhouette_Score", "Inertia"],
    ascending=[False, True],
    na_position="last"
).reset_index(drop=True)

In [None]:
# 6) TAMPILKAN RANGKUMAN
results_df = results_df.round({"Silhouette_Score": 3, "Inertia": 0})
results_df

Unnamed: 0,Model,Type,Clusters,Silhouette_Score,Inertia
0,KMeans_k=2,KMeans,2,0.403,165.0
1,KMeans_k=4,KMeans,4,0.321,99.0
2,KMeans_k=5,KMeans,5,0.308,82.0
3,KMeans_k=3,KMeans,3,0.305,132.0
4,KMeans_k=7,KMeans,7,0.303,63.0
5,KMeans_k=8,KMeans,8,0.294,57.0
6,KMeans_k=6,KMeans,6,0.289,74.0
7,KMeans_k=9,KMeans,9,0.277,53.0
8,KMeans_k=10,KMeans,10,0.275,49.0
9,DBSCAN_eps=0.1_min=5,DBSCAN,4,0.16,


In [None]:
# ambil hanya baris KMeans, urutkan x (Clusters) dari kecil ke besar
km = (results_df
      .query("Type == 'KMeans'")
      .sort_values('Clusters')
      .reset_index(drop=True))

# elbow plot
fig = px.line(km, x='Clusters', y='Inertia', markers=True,
              title='Elbow Plot (KMeans)',
              labels={'Clusters':'k (jumlah cluster)', 'Inertia':'SSE / Inertia'})
fig.update_layout(xaxis=dict(dtick=1))
fig.show()

In [None]:
# --- 1) Siapkan data KMeans ---
km = (results_df
      .query("Type == 'KMeans'")
      .dropna(subset=['Silhouette_Score'])
      .sort_values('Clusters')
      .reset_index(drop=True))

# --- 2) Siapkan data DBSCAN ---
db = results_df.query("Type == 'DBSCAN'").copy()
# jika kolom eps/min_samples tidak ada, ekstrak dari kolom Model
if 'eps' not in db.columns or 'min_samples' not in db.columns:
    m = db['Model'].str.extract(r'eps=([0-9.]+).*?min=(\d+)', expand=True)
    db['eps'] = m[0].astype(float)
    db['min_samples'] = m[1].astype(int)
db = db.dropna(subset=['Silhouette_Score'])
db = db.sort_values(['min_samples', 'eps']).reset_index(drop=True)

# --- 3) Buat subplots 1x2 ---
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        "KMeans — Silhouette vs k",
        "DBSCAN — Silhouette vs eps (per min_samples)"
    )
)

# --- 4) KMeans: line plot (k vs silhouette) ---
if not km.empty:
    fig.add_trace(
        go.Scatter(
            x=km['Clusters'], y=km['Silhouette_Score'],
            mode='lines+markers', name='KMeans',
            hovertemplate='k=%{x}<br>silhouette=%{y:.3f}<extra></extra>'
        ),
        row=1, col=1
    )

# --- 5) DBSCAN: line per min_samples (eps vs silhouette) ---
for ms, grp in db.groupby('min_samples', sort=True):
    fig.add_trace(
        go.Scatter(
            x=grp['eps'], y=grp['Silhouette_Score'],
            mode='lines+markers', name=f'DBSCAN (min_samples={ms})',
            hovertemplate='eps=%{x}<br>silhouette=%{y:.3f}<extra></extra>'
        ),
        row=1, col=2
    )

# --- 6) Tata letak ---
fig.update_xaxes(title_text="k (Clusters)", dtick=1, row=1, col=1)
fig.update_yaxes(title_text="Silhouette Score", row=1, col=1)

fig.update_xaxes(title_text="eps", row=1, col=2)
fig.update_yaxes(title_text="Silhouette Score", row=1, col=2)

fig.update_layout(
    title="Silhouette Score — KMeans & DBSCAN",
    height=450, width=1500,
    legend=dict(orientation='h', yanchor='bottom', y=1.1, x=0.5, xanchor='center')
)

fig.show()

In [None]:
results_df_kmeans = results_df[results_df['Type'] == 'KMeans'].sort_values(
    "Inertia",
    ascending=[False],
    na_position="last"
).reset_index(drop=True)

results_df_kmeans['diff_inertia'] = results_df_kmeans['Inertia'].shift(1) - results_df_kmeans['Inertia']
results_df_kmeans['diff_inertia_2'] = results_df_kmeans['diff_inertia'].shift(1) - results_df_kmeans['diff_inertia']

results_df_kmeans

Unnamed: 0,Model,Type,Clusters,Silhouette_Score,Inertia,diff_inertia,diff_inertia_2
0,KMeans_k=2,KMeans,2,0.403,165.0,,
1,KMeans_k=3,KMeans,3,0.305,132.0,33.0,
2,KMeans_k=4,KMeans,4,0.321,99.0,33.0,0.0
3,KMeans_k=5,KMeans,5,0.308,82.0,17.0,16.0
4,KMeans_k=6,KMeans,6,0.289,74.0,8.0,9.0
5,KMeans_k=7,KMeans,7,0.303,63.0,11.0,-3.0
6,KMeans_k=8,KMeans,8,0.294,57.0,6.0,5.0
7,KMeans_k=9,KMeans,9,0.277,53.0,4.0,2.0
8,KMeans_k=10,KMeans,10,0.275,49.0,4.0,0.0


Jika kita lihat dari hasil elbow dan perhitungan inertia/SSE(sum of squared error/jumlah jarak tiap titik ke centroid terdekatnya), cluster ke 5 adalah yang terbaik karena perbedaan inertia yang dihasilkan antara 4 ke 5 yang masih signifikan (+-14) dan antara 5 ke 6 yang mulai tidak signifikan (+-5). Namun jika dilihat dari silhouette evaluation, cluster kedua adalah yang terbaik meskipun jarak antar titik di dalam cluster itu sendiri masih terlalu jauh.

In [None]:
# Data
X = product_rfm[['recency_lognorm','quantity_lognorm','revenue_lognorm']].values

# Fit model KMeans dengan k terbaik
k_best = 5
km = KMeans(n_clusters=k_best, random_state=9).fit(X)
labels = km.labels_

# Hitung silhouette
silh_avg = silhouette_score(X, labels)
silh_vals = silhouette_samples(X, labels)

# Buat figure
fig = go.Figure()
y_pos = 0

# Loop tiap cluster
for c in np.unique(labels):
    c_vals = silh_vals[labels == c]
    c_vals.sort()
    size = len(c_vals)
    y_range = list(range(y_pos, y_pos + size))

    fig.add_trace(go.Bar(
        x=c_vals,
        y=y_range,
        orientation='h',
        name=f"Cluster {c}",
        marker=dict(line=dict(width=0)),
        showlegend=True
    ))

    y_pos += size + 10  # kasih gap antar cluster

# Garis rata-rata
fig.add_shape(
    type="line",
    x0=silh_avg, x1=silh_avg,
    y0=0, y1=y_pos,
    line=dict(color="red", dash="dash"),
)

# Layout
fig.update_layout(
    title=f"Silhouette Plot — KMeans (k={k_best})",
    xaxis_title="Silhouette Coefficient",
    yaxis_title="Samples (clustered, ordered)",
    barmode="overlay",
    height=600,
    width=900
)

fig.show()

hasil silhouette pun menunjukkan 5 cluster ini hanya terdapat sedikit data yang minus. Sehingga kita simpulkan bahwa 5 cluster adalah yang paling optimal. Sekarang kita tambahkan label cluster pada dataset

In [None]:
# Data
X = product_rfm[['recency_lognorm','quantity_lognorm','revenue_lognorm']].values

# Fit model DBSCAN
eps_val, min_s = 0.1, 5
db = DBSCAN(eps=eps_val, min_samples=min_s).fit(X)
labels = db.labels_

# Mask untuk buang noise (-1), karena tidak dihitung sebagai cluster
mask = labels != -1
X_masked = X[mask]
labels_masked = labels[mask]

# Jika cluster lebih dari 1 → hitung silhouette
if len(set(labels_masked)) > 1:
    silh_avg = silhouette_score(X_masked, labels_masked)
    silh_vals = silhouette_samples(X_masked, labels_masked)

    # Buat figure
    fig = go.Figure()
    y_pos = 0

    # Loop tiap cluster (exclude noise)
    for c in np.unique(labels_masked):
        c_vals = silh_vals[labels_masked == c]
        c_vals.sort()
        size = len(c_vals)
        y_range = list(range(y_pos, y_pos + size))

        fig.add_trace(go.Bar(
            x=c_vals,
            y=y_range,
            orientation='h',
            name=f"Cluster {c}",
            marker=dict(line=dict(width=0)),
            showlegend=True
        ))

        y_pos += size + 10  # gap antar cluster

    # Garis rata-rata
    fig.add_shape(
        type="line",
        x0=silh_avg, x1=silh_avg,
        y0=0, y1=y_pos,
        line=dict(color="red", dash="dash"),
    )

    # Layout
    fig.update_layout(
        title=f"Silhouette Plot — DBSCAN (eps={eps_val}, min_samples={min_s})",
        xaxis_title="Silhouette Coefficient",
        yaxis_title="Samples (clustered, ordered)",
        barmode="overlay",
        height=600,
        width=900
    )

    fig.show()
else:
    print(f"DBSCAN (eps={eps_val}, min_samples={min_s}) hanya menghasilkan 1 cluster → silhouette tidak bisa dihitung.")

Dengan rendahnya silhouette score untuk dbscan, hal ini menunjukkan bahwa dbscan tidak cocok untuk data ini. Silhouette score minus semakin memperkuat alasan tersebut

## Predict

In [None]:
# --- Fit KMeans ---
model_kmeans = KMeans(n_clusters=5, random_state=9)
product_rfm['cluster_kmeans'] = model_kmeans.fit_predict(
    product_rfm[['recency_lognorm', 'quantity_lognorm', 'revenue_lognorm']]
).astype(int)

# --- Fit DBSCAN ---
model_dbscan = DBSCAN(eps=0.1, min_samples=5)
product_rfm['cluster_dbscan'] = model_dbscan.fit_predict(
    product_rfm[['recency_lognorm', 'quantity_lognorm', 'revenue_lognorm']]
).astype(int)

# --- Lihat hasil ---
product_rfm.head()

Unnamed: 0,StockCode,recency,Quantity,revenue,recency_norm,quantity_norm,revenue_norm,recency_log,recency_lognorm,quantity_log,quantity_lognorm,revenue_log,revenue_lognorm,cluster_kmeans,cluster_dbscan
0,10080,26,12,4.68,0.067024,0.001832,0.000692,3.258097,0.549959,2.484907,0.28561,1.543298,0.306932,2,0
1,10120,6,8,1.68,0.013405,0.001166,0.00023,1.791759,0.302445,2.079442,0.239007,0.518794,0.208789,2,-1
2,10123C,281,3,1.95,0.75067,0.000333,0.000271,5.638355,0.951741,1.098612,0.126272,0.667829,0.223066,1,0
3,10124A,268,4,1.68,0.715818,0.0005,0.00023,5.590987,0.943745,1.386294,0.159338,0.518794,0.208789,1,0
4,10124G,373,5,2.1,0.997319,0.000666,0.000294,5.921578,0.999548,1.609438,0.184985,0.741937,0.230165,1,0


Selain itu kita tambahkan pula nilai centroid dari masing-masing cluster

In [None]:
centroid = model_kmeans.cluster_centers_.tolist()
keys = [0,1,2,3,4]

centroid_df = pd.DataFrame.from_dict(dict(zip(keys, centroid))).T

cols = {'index': 'cluster',
        0: 'recency_centroid',
        1: 'quantity_centroid',
        2: 'revenue_centroid'}

centroid_df.reset_index(inplace = True)
centroid_df.rename(columns = cols, inplace = True)

centroid_df

Unnamed: 0,cluster,recency_centroid,quantity_centroid,revenue_centroid
0,0,0.539606,0.518567,0.622273
1,1,0.860662,0.097058,0.314633
2,2,0.44827,0.248559,0.430904
3,3,0.817949,0.362692,0.502741
4,4,0.233548,0.553203,0.677411


In [None]:
product_rfm = pd.merge(product_rfm, centroid_df, how = 'inner', left_on = 'cluster_kmeans', right_on = 'cluster')
product_rfm

Unnamed: 0,StockCode,recency,Quantity,revenue,recency_norm,quantity_norm,revenue_norm,recency_log,recency_lognorm,quantity_log,quantity_lognorm,revenue_log,revenue_lognorm,cluster_kmeans,cluster_dbscan,cluster,recency_centroid,quantity_centroid,revenue_centroid
0,10080,26,12,4.68,0.067024,0.001832,0.000692,3.258097,0.549959,2.484907,0.285610,1.543298,0.306932,2,0,2,0.448270,0.248559,0.430904
1,10120,6,8,1.68,0.013405,0.001166,0.000230,1.791759,0.302445,2.079442,0.239007,0.518794,0.208789,2,-1,2,0.448270,0.248559,0.430904
2,10123C,281,3,1.95,0.750670,0.000333,0.000271,5.638355,0.951741,1.098612,0.126272,0.667829,0.223066,1,0,1,0.860662,0.097058,0.314633
3,10124A,268,4,1.68,0.715818,0.000500,0.000230,5.590987,0.943745,1.386294,0.159338,0.518794,0.208789,1,0,1,0.860662,0.097058,0.314633
4,10124G,373,5,2.10,0.997319,0.000666,0.000294,5.921578,0.999548,1.609438,0.184985,0.741937,0.230165,1,0,1,0.860662,0.097058,0.314633
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2593,90214C,207,1,1.25,0.552279,0.000000,0.000163,5.332719,0.900150,0.000000,0.000000,0.223144,0.180467,1,0,1,0.860662,0.097058,0.314633
2594,90214G,3,1,0.29,0.005362,0.000000,0.000015,1.098612,0.185443,0.000000,0.000000,-1.237874,0.040508,2,-1,2,0.448270,0.248559,0.430904
2595,90214I,2,1,0.29,0.002681,0.000000,0.000015,0.693147,0.117002,0.000000,0.000000,-1.237874,0.040508,2,-1,2,0.448270,0.248559,0.430904
2596,90214K,184,3,3.75,0.490617,0.000333,0.000548,5.214936,0.880268,1.098612,0.126272,1.321756,0.285709,1,0,1,0.860662,0.097058,0.314633


In [None]:
# product_rfm['cluster'].value_counts()
product_rfm['recency_centroid'].unique()[0]

0.7566952116518411

Kita sajikan hasil analisis kita ke dalam plot clustering

In [None]:
px.scatter_3d(data_frame = product_rfm, x='recency', y='Quantity', z='revenue', color = 'cluster_kmeans',
             title='3D Scatter plot for KMeans Clusters')

In [None]:
px.scatter_3d(data_frame = product_rfm, x='recency', y='Quantity', z='revenue', color = 'cluster_dbscan',
             title='3D Scatter plot for KMeans Clusters')

Sangat sulit kita representasikan jika menggunakan kolom tanpa scale sehingga kita buat plot dalam bentuk scaling juga

In [None]:
px.scatter_3d(data_frame = product_rfm, x='recency_lognorm', y='quantity_lognorm', z='revenue_lognorm', color = 'cluster_kmeans',
             title='3D Scatter plot for KMeans Clusters')

In [None]:
px.scatter_3d(data_frame = product_rfm, x='recency_lognorm', y='quantity_lognorm', z='revenue_lognorm', color = 'cluster_dbscan',
             title='3D Scatter plot for KMeans Clusters')

Selanjutnya kita buat pengelompokkan variabel-variabel RFM asli dari masing-masing produk berdasarkan cluster yang sudah kita buat

In [None]:
# product_rfm.groupby('cluster', as_index = False).agg({'recency': ['mean'], 'Quantity': ['mean'], 'revenue': ['mean', 'count']})

rfm_group = product_rfm.groupby('cluster_kmeans', as_index = False).agg(recency_min = ('recency', 'min'), recency_mean = ('recency', 'mean'), recency_max = ('recency', 'max'),
                                                                 quantity_min = ('Quantity', 'min'), quantity_mean = ('Quantity', 'mean'), quantity_max = ('Quantity', 'max'),
                                                                 revenue_min = ('revenue', 'min'), revenue_mean = ('revenue', 'mean'), revenue_max = ('revenue', 'max'),
                                                                 count_rfm = ('revenue', 'count'))

rfm_group = rfm_group.sort_values(by = ['recency_min', 'recency_mean', 'recency_max', 'quantity_min', 'quantity_mean', 'quantity_max', 'revenue_min', 'revenue_mean', 'revenue_max'], ascending = [True, True, True, False, False, False, False, False, False])
rfm_group

Unnamed: 0,cluster_kmeans,recency_min,recency_mean,recency_max,quantity_min,quantity_mean,quantity_max,revenue_min,revenue_mean,revenue_max,count_rfm
4,4,1,4.675174,12,11,214.482599,2277,17.77,431.972691,6490.93,431
2,2,1,18.697628,51,1,12.660079,55,0.19,27.345534,340.0,506
0,0,9,28.302326,146,15,148.075134,6005,13.68,199.921234,3051.36,559
1,1,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443
3,3,40,151.312595,367,2,34.201821,246,2.88,54.994112,613.92,659


## Insight Segmentasi Pelanggan dengan KMeans

Berdasarkan hasil clustering KMeans (k=5) pada data RFM, diperoleh 5 segmen pelanggan:

---

### 🔹 Cluster 4 — **Champions (High Value)**
- **Recency mean:** 4.7 hari (sangat baru, paling kecil)  
- **Quantity mean:** 214 (tertinggi di semua cluster)  
- **Revenue mean:** 431.97 (tertinggi)  
➡️ **Pelanggan terbaik**, sering belanja dalam jumlah besar dan baru-baru ini.  
⚡ *Strategi:* pertahankan dengan loyalty program, promo eksklusif, atau early access produk.

---

### 🔹 Cluster 0 — **Potential Loyalists**
- **Recency mean:** 28.3 hari (menengah)  
- **Quantity mean:** 148 (tinggi)  
- **Revenue mean:** 199.9 (cukup tinggi)  
➡️ **Pelanggan bernilai tinggi**, tapi tidak se-intens Champions.  
⚡ *Strategi:* dorong mereka naik kelas jadi Champions dengan promo spesial atau membership.

---

### 🔹 Cluster 2 — **New Customers**
- **Recency mean:** 18.7 hari (cukup baru)  
- **Quantity mean:** 12.6 (rendah)  
- **Revenue mean:** 27.3 (rendah)  
➡️ **Pelanggan baru**, sudah belanja tapi nilainya kecil.  
⚡ *Strategi:* edukasi produk, beri promo awal agar mereka lebih sering bertransaksi.

---

### 🔹 Cluster 3 — **Hibernating**
- **Recency mean:** 151 hari (lama)  
- **Quantity mean:** 34 (rendah)  
- **Revenue mean:** 54.9 (rendah)  
➡️ **Pelanggan lama tidak aktif**, belanja kecil.  
⚡ *Strategi:* coba re-activation campaign, email marketing, atau comeback promo.

---

### 🔹 Cluster 1 — **Lost Customers**
- **Recency mean:** 197.7 hari (sangat lama, tertinggi)  
- **Quantity mean:** 3.0 (sangat rendah)  
- **Revenue mean:** 8.4 (paling rendah)  
➡️ **Pelanggan hilang**, sangat jarang belanja, value rendah.  
⚡ *Strategi:* retargeting atau dibiarkan sebagai segmen sunset.

---

### 📝 Ringkasan
1. **Cluster 4 → Champions (High Value, High Frequency, Recent)**  
2. **Cluster 0 → Potential Loyalists (Medium Recency, High Value)**  
3. **Cluster 2 → New Customers (Recent, Low Value)**  
4. **Cluster 3 → Hibernating (Old Recency, Low Value)**  
5. **Cluster 1 → Lost Customers (Very Old Recency, Very Low Value)**

In [None]:
product_rfm = pd.merge(product_rfm, rfm_group, how = 'inner', left_on = 'cluster_kmeans', right_on = 'cluster_kmeans')
product_rfm

Unnamed: 0,StockCode,recency,Quantity,revenue,recency_norm,quantity_norm,revenue_norm,recency_log,recency_lognorm,quantity_log,...,recency_min,recency_mean,recency_max,quantity_min,quantity_mean,quantity_max,revenue_min,revenue_mean,revenue_max,count_rfm
0,10080,26,12,4.68,0.067024,0.001832,0.000692,3.258097,0.549959,2.484907,...,1,18.697628,51,1,12.660079,55,0.19,27.345534,340.00,506
1,10120,6,8,1.68,0.013405,0.001166,0.000230,1.791759,0.302445,2.079442,...,1,18.697628,51,1,12.660079,55,0.19,27.345534,340.00,506
2,10123C,281,3,1.95,0.750670,0.000333,0.000271,5.638355,0.951741,1.098612,...,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443
3,10124A,268,4,1.68,0.715818,0.000500,0.000230,5.590987,0.943745,1.386294,...,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443
4,10124G,373,5,2.10,0.997319,0.000666,0.000294,5.921578,0.999548,1.609438,...,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2593,90214C,207,1,1.25,0.552279,0.000000,0.000163,5.332719,0.900150,0.000000,...,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443
2594,90214G,3,1,0.29,0.005362,0.000000,0.000015,1.098612,0.185443,0.000000,...,1,18.697628,51,1,12.660079,55,0.19,27.345534,340.00,506
2595,90214I,2,1,0.29,0.002681,0.000000,0.000015,0.693147,0.117002,0.000000,...,1,18.697628,51,1,12.660079,55,0.19,27.345534,340.00,506
2596,90214K,184,3,3.75,0.490617,0.000333,0.000548,5.214936,0.880268,1.098612,...,23,197.706546,374,1,3.002257,12,0.29,8.432799,50.85,443


## **Save Model & Data**

In [None]:
file_path = "/content/drive/MyDrive/Colab Notebooks/model"        # tentukan folder penyimpanan
os.makedirs(file_path, exist_ok=True)             # buat folder jika belum ada

# simpan model
joblib.dump(model_kmeans, os.path.join(file_path, "kmeans_model.pkl"))
print("✔ Model KMeans berhasil disimpan")

✔ Model KMeans berhasil disimpan


In [None]:
from google.colab import drive
drive.mount('drive')

product_rfm.to_csv('product_rfm_result.csv', sep=';', encoding='utf-8')

!cp product_rfm_result.csv "drive/My Drive/Dataset"

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


## Load Model

In [None]:
# path model
file_path = "/content/drive/MyDrive/Colab Notebooks/model"
model_file = os.path.join(file_path, "kmeans_model.pkl")

# 1) Load model KMeans
loaded_kmeans = joblib.load(model_file)
print("✔ Model KMeans berhasil di-load")

✔ Model KMeans berhasil di-load


## Predict New Data

In [None]:
# ===== 2) Data baru (angka real) =====
new_data = pd.DataFrame({
    "recency_lognorm": [15, 120],
    "quantity_lognorm": [50, 5],
    "revenue_lognorm": [300.0, 20.0]
})

In [None]:
# ===== 3) Transformasi manual yang sama =====
X_log_new = np.log1p(new_data)                 # log dari angka real

# ===== 4) Prediksi cluster =====
pred_clusters = loaded_kmeans.predict(X_log_new)
print("Prediksi cluster:", pred_clusters)

Prediksi cluster: [0 3]


## **Remarks**

Kita berhasil membuat cluster produk dari dataset dan memberikan rekomendasi action untuk stakeholder dalam strategi bisnis ke depan secara data-driven.

Pertemuan berikutnya kita akan membahas profiling pelanggan yang tujuannya para pelanggan ini mendapatkan campaign marketing yang personalized.