## ****Formulasi Masalah: jelaskan apa permasalahan yang akan diselesaikan.****

Bagaimana cara mengelompokkan pelanggan grosir berdasarkan perilaku pembelian mereka agar perusahaan dapat memahami segmentasi pelanggan dan merancang strategi bisnis yang lebih efektif?

Dalam hal ini, data pelanggan yang terdiri dari pengeluaran tahunan pada berbagai kategori produk (Fresh, Milk, Grocery, Frozen, Detergents_Paper, dan Delicatessen) akan dianalisis untuk menemukan pola pembelian yang mirip, yang kemudian dikelompokkan menjadi segmen-segmen pelanggan.

**Mengelompokkan produk berdasarkan pola penjualan untuk strategi pemasaran yang lebih terfokus**

## ***Eksplorasi dan Persiapan Data (termasuk data splitting): lakukan semua teknik eksplorasi dan persiapan data yang menurut Anda perlu dilakukan.***

In [None]:
import csv
import math
import random

### ***Membaca data***

In [None]:
# 1️⃣ Baca CSV TANPA LIBRARY
filename = "Wholesale customers data.csv"

# List untuk simpan header dan data
header = []
data = []

with open(filename, 'r') as f:
    lines = f.readlines()
    # Baris pertama = header
    header = lines[0].strip().split(",")
    # Baris berikutnya = data
    for line in lines[1:]:
        row = line.strip().split(",")
        # Ubah ke int semua kolom
        row = [int(x) for x in row]
        data.append(row)

# 2️⃣ Fungsi print_table milikmu (TANPA LIBRARY)
def print_table(header, data):
    col_widths = [len(str(h)) for h in header]
    for row in data:
        for i, val in enumerate(row):
            col_widths[i] = max(col_widths[i], len(str(val)))

    def format_row(row):
        return " | ".join(str(val).rjust(col_widths[i]) for i, val in enumerate(row))

    print(format_row(header))
    print("-" * (sum(col_widths) + 3 * (len(header) - 1)))
    for row in data:
        print(format_row(row))

# 3️⃣ Cetak tabel — batasi baris biar tidak panjang
print("Data Wholesale Customers:\n")
print_table(header, data)


Data Wholesale Customers:

Channel | Region |  Fresh |  Milk | Grocery | Frozen | Detergents_Paper | Delicassen
------------------------------------------------------------------------------------
      2 |      3 |  12669 |  9656 |    7561 |    214 |             2674 |       1338
      2 |      3 |   7057 |  9810 |    9568 |   1762 |             3293 |       1776
      2 |      3 |   6353 |  8808 |    7684 |   2405 |             3516 |       7844
      1 |      3 |  13265 |  1196 |    4221 |   6404 |              507 |       1788
      2 |      3 |  22615 |  5410 |    7198 |   3915 |             1777 |       5185
      2 |      3 |   9413 |  8259 |    5126 |    666 |             1795 |       1451
      2 |      3 |  12126 |  3199 |    6975 |    480 |             3140 |        545
      2 |      3 |   7579 |  4956 |    9426 |   1669 |             3321 |       2566
      1 |      3 |   5963 |  3648 |    6192 |    425 |             1716 |        750
      2 |      3 |   6006 | 11093 |   

In [None]:
# Fungsi menampilkan tabel rapi
def print_table(header, data, limit=5):
    # Hitung lebar maksimum per kolom
    col_widths = [len(h) for h in header]
    for row in data[:limit]:
        for i, val in enumerate(row):
            col_widths[i] = max(col_widths[i], len(str(val)))

    # Format satu baris
    def format_row(row):
        return " | ".join(str(val).rjust(col_widths[i]) for i, val in enumerate(row))

    # Cetak header
    print(format_row(header))
    print("-" * (sum(col_widths) + 3 * (len(header) - 1)))

    # Cetak data
    for row in data[:limit]:
        print(format_row(row))

# Tampilkan 5 baris pertama
print("5 Data Pertama :\n")
print_table(header, data, limit=5)


5 Data Pertama :

Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen
----------------------------------------------------------------------------------
      2 |      3 | 12669 | 9656 |    7561 |    214 |             2674 |       1338
      2 |      3 |  7057 | 9810 |    9568 |   1762 |             3293 |       1776
      2 |      3 |  6353 | 8808 |    7684 |   2405 |             3516 |       7844
      1 |      3 | 13265 | 1196 |    4221 |   6404 |              507 |       1788
      2 |      3 | 22615 | 5410 |    7198 |   3915 |             1777 |       5185


### ***Menampilkan jumlah data (baris dan kolom)***

In [None]:
# Menampilkan jumlah baris dan kolom dari data
jumlah_baris = len(data)
jumlah_kolom = len(header)

print("Jumlah Data:")
print(f"- Baris : {jumlah_baris}")
print(f"- Kolom : {jumlah_kolom}")


Jumlah Data:
- Baris : 440
- Kolom : 8


### ***Cek Missing Value***

In [None]:
# Cek missing value
print("Cek Missing Value per Kolom:")

missing_count = [0] * len(header)

for row in data:
    for i, val in enumerate(row):
        if val == '' or val is None:
            missing_count[i] += 1

# Tampilkan hasil
ada_kosong = False
for i, count in enumerate(missing_count):
    if count > 0:
        ada_kosong = True
        print(f"- {header[i]}: {count} data kosong")

if not ada_kosong:
    print("0")


Cek Missing Value per Kolom:
0


### ***Statistik Deskriptif***

In [None]:
# Fungsi untuk menghitung standar deviasi (tanpa library)
def std_dev(values):
    mean_val = sum(values) / len(values)
    variance = sum((x - mean_val) ** 2 for x in values) / len(values)
    return variance ** 0.5

# Ambil statistik
statistics = {
    "Min": [],
    "Max": [],
    "Mean": [],
    "Std Dev": []
}

for i in range(len(header)):
    col_values = [float(row[i]) for row in data]
    min_val = min(col_values)
    max_val = max(col_values)
    mean_val = sum(col_values) / len(col_values)
    std_val = std_dev(col_values)

    statistics["Min"].append(min_val)
    statistics["Max"].append(max_val)
    statistics["Mean"].append(round(mean_val, 2))
    statistics["Std Dev"].append(round(std_val, 2))

# Cetak header
print(f"{'Statistik':15}|", end='')
for col in header:
    print(f"{col:>10} |", end='')
print("\n" + "-" * (15 + 13 * len(header)))

# Cetak tiap baris statistik
for stat_name in statistics:
    print(f"{stat_name:15}|", end='')
    for val in statistics[stat_name]:
        print(f"{val:10.2f} |", end='')
    print()


Statistik      |   Channel |    Region |     Fresh |      Milk |   Grocery |    Frozen |Detergents_Paper |Delicassen |
-----------------------------------------------------------------------------------------------------------------------
Min            |      1.00 |      1.00 |      3.00 |     55.00 |      3.00 |     25.00 |      3.00 |      3.00 |
Max            |      2.00 |      3.00 | 112151.00 |  73498.00 |  92780.00 |  60869.00 |  40827.00 |  47943.00 |
Mean           |      1.32 |      2.54 |  12000.30 |   5796.27 |   7951.28 |   3071.93 |   2881.49 |   1524.87 |
Std Dev        |      0.47 |      0.77 |  12632.95 |   7371.99 |   9492.36 |   4849.15 |   4762.43 |   2816.90 |


### ***Normalisasi Data***

Karena Channel dan Region adalah data kategori (nominal), maka tidak perlu dinormalisasi.

Kita hanya akan normalisasi kolom numerik kontinu:

Fresh,
Milk.
Grocery,
Frozen,
Detergents_Paper,
Delicassen.

In [None]:
numerik_index = list(range(2, len(header)))
means = []
stds = []

for j in numerik_index:
    col = [row[j] for row in data]
    mean = sum(col) / len(col)
    std = (sum((x - mean) ** 2 for x in col) / len(col)) ** 0.5
    means.append(mean)
    stds.append(std)


normalized_data = []

for row in data:
    new_row = row[:2]  # Simpan Channel dan Region tetap
    for i, j in enumerate(numerik_index):
        value = row[j]
        z = (value - means[i]) / stds[i]
        new_row.append(z)
    normalized_data.append(new_row)

# Step 4: Tampilkan 5 baris pertama hasil normalisasi
print("\n📊 5 Data Normalisasi Pertama (Z-Score)\n")
print(f"{'Baris':<6} | {'Channel':<7} | {'Region':<6} | {'Fresh':<9} | {'Milk':<9} | {'Grocery':<9} | {'Frozen':<9} | {'Detergents_Paper':<17} | {'Delicassen'}")
print("-" * 110)

for i, row in enumerate(normalized_data[:5]):
    print(f"{i+1:<6} | {row[0]:<7} | {row[1]:<6} | {row[2]:<9.4f} | {row[3]:<9.4f} | {row[4]:<9.4f} | {row[5]:<9.4f} | {row[6]:<17.4f} | {row[7]:.4f}")



📊 5 Data Normalisasi Pertama (Z-Score)

Baris  | Channel | Region | Fresh     | Milk      | Grocery   | Frozen    | Detergents_Paper  | Delicassen
--------------------------------------------------------------------------------------------------------------
1      | 2       | 3      | 0.0529    | 0.5236    | -0.0411   | -0.5894   | -0.0436           | -0.0663
2      | 2       | 3      | -0.3913   | 0.5445    | 0.1703    | -0.2701   | 0.0864            | 0.0892
3      | 2       | 3      | -0.4470   | 0.4085    | -0.0282   | -0.1375   | 0.1332            | 2.2433
4      | 1       | 3      | 0.1001    | -0.6240   | -0.3930   | 0.6871    | -0.4986           | 0.0934
5      | 2       | 3      | 0.8402    | -0.0524   | -0.0794   | 0.1739    | -0.2319           | 1.2993


*Tidak dilakukan data splitting karena kita menerapkan metode unsupervised (clustering) yang tidak membutuhkan label untuk pelatihan atau evaluasi. Justru seluruh data digunakan agar dapat menangkap pola pembelian pelanggan seakurat mungkin.*

--- alsasan Kenapa normalisasi Z-score yang dipilih? ---

Pada dataset Wholesale Customers, setiap kolom numerik (Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen) punya skala yang sangat berbeda. Contoh:

Fresh bisa ratusan ribu satuannya,
Detergents_Paper hanya puluhan ribu,
Delicassen malah ribuan.
Kalau clustering pakai Euclidean distance, maka fitur dengan nilai yang besar akan mendominasi jarak, sedangkan fitur skala kecil jadi tidak berpengaruh.

Contoh:
Jika tidak dinormalisasi, jarak antara dua pelanggan akan 90% ditentukan Fresh dan Grocery saja.



## ***Pemodelan: bangunlah model menggunakan data hasil praproses***

2.b, dan lakukan
proses training untuk mendapatkan hasil terbaik.

### ***K-Means Manual***

In [None]:
def k_means_manual(data, k=3, max_iter=100, tol=1e-4): #Mengimplementasikan algoritma K-Means secara manual (tanpa library clustering).
    centroids = random.sample(data, k)  # Inisialisasi centroid secara acak
    for _ in range(max_iter):
        labels = []
        clusters = [[] for _ in range(k)]  # Menyiapkan list kosong untuk tiap klaster

        # Assign data ke klaster terdekat
        for point in data:
            distances = [euclidean(point, centroid) for centroid in centroids]
            cluster_idx = distances.index(min(distances))
            labels.append(cluster_idx)
            clusters[cluster_idx].append(point)

        # Hitung centroid baru berdasarkan rata-rata tiap klaster
        new_centroids = []
        for cluster in clusters:
            if cluster:
                new_centroid = [sum(dim) / len(cluster) for dim in zip(*cluster)]
            else:
                # Jika klaster kosong, ambil centroid baru secara acak
                new_centroid = random.choice(data)
            new_centroids.append(new_centroid)

        # Hitung pergeseran total centroid
        shift = sum(euclidean(a, b) for a, b in zip(centroids, new_centroids))
        if shift < tol:
            break  # Stop jika perubahan centroid sudah sangat kecil

        centroids = new_centroids

    return labels, centroids

### ***Fungsi Euclidean distance***

In [None]:
def euclidean(p1, p2): #fungsi ini menghitung jarak Euclidean antara dua titik (vektor) berdimensi sama.
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))

--- Kenapa pakai Euclidean? ---

Euclidean Distance dipilih karena merupakan metrik jarak yang paling sesuai dengan asumsi K-Means, yaitu meminimalkan jarak kuadrat ke centroid. Selain itu, seluruh fitur dalam dataset adalah numerik kontinu, sehingga jarak Euclidean efektif untuk merepresentasikan kemiripan perilaku pembelian antar pelanggan grosir.

**“Model yang digunakan adalah K-Means Clustering, diimplementasikan manual menggunakan Python tanpa library. Fungsi Euclidean digunakan sebagai metrik jarak sesuai karakter data numerik.”**

### ***Fungsi silhouette_score***

In [None]:
def silhouette_score_manual(data, labels): #Menghitung skor Silhouette secara manual untuk hasil clustering.
    n = len(data)
    scores = []

    for i in range(n):
        point = data[i]
        own_cluster = labels[i]

        # Hitung rata-rata jarak ke klaster sendiri (a)
        same_cluster = [data[j] for j in range(n) if labels[j] == own_cluster and j != i]
        a = sum(euclidean(point, p) for p in same_cluster) / len(same_cluster) if same_cluster else 0

        # Hitung rata-rata jarak minimum ke klaster lain (b)
        b = float('inf')
        for cluster_id in set(labels):
            if cluster_id == own_cluster:
                continue
            other = [data[j] for j in range(n) if labels[j] == cluster_id]
            dist = sum(euclidean(point, p) for p in other) / len(other)
            b = min(b, dist)

        # Skor Silhouette untuk satu titik
        score = (b - a) / max(a, b) if max(a, b) > 0 else 0
        scores.append(score)

    return sum(scores) / len(scores)


## ***Evaluasi: pilih metode evaluasi***

### ***Silhouette Score Manual***

In [None]:
def euclidean(p1, p2): #fungsi ini menghitung jarak Euclidean antara dua titik (vektor) berdimensi sama.
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))


In [None]:
def silhouette_score_manual(data, labels):
    n = len(data)
    scores = []
    for i in range(n):
        point = data[i]
        own_cluster = labels[i]
        same_cluster = [data[j] for j in range(n) if labels[j]==own_cluster and j!=i]
        a = sum(euclidean(point, p) for p in same_cluster)/len(same_cluster) if same_cluster else 0
        b = float('inf')
        for cluster_id in set(labels):
            if cluster_id == own_cluster:
                continue
            other = [data[j] for j in range(n) if labels[j]==cluster_id]
            dist = sum(euclidean(point, p) for p in other)/len(other)
            b = min(b, dist)
        score = (b - a) / max(a, b) if max(a,b) > 0 else 0
        scores.append(score)
    return sum(scores)/len(scores)


--- Kenapa Silhouette Score untuk evaluasi? ---

Silhouette Score dipilih karena merupakan metrik internal yang dapat menilai seberapa baik hasil cluster tanpa membutuhkan label benar. Silhouette Score mengukur keseimbangan antara jarak dalam cluster dan jarak antar cluster, sehingga membantu memastikan cluster yang terbentuk kompak dan terpisah dengan jelas. Selain itu, Silhouette Score juga mempermudah penentuan jumlah cluster yang optimal.

## ***Eksperimen***

“Beberapa percobaan dilakukan dengan variasi jumlah cluster (K = 2 s/d 5). Silhouette Score digunakan untuk menilai hasil. Nilai terbaik diperoleh pada K = 2 dengan Silhouette 0.1674.”

### ***Ekperimen 1 (Memakai Normalisasi Z-Score)***

#### ***Menentukan jumlah cluster terbaik (metode Silhouette Score manual)***

In [None]:
def cari_k_optimal(data, k_range=(2,5)):
    scores = {}
    for k in range(k_range[0], k_range[1]+1):
        labels, _ = k_means_manual(data, k)
        score = silhouette_score_manual(data, labels)
        scores[k] = score
    return scores

silhouette_scores = cari_k_optimal(normalized_data, k_range=(2,5))
print("\nSilhouette Scores per K:")
for k, score in silhouette_scores.items():
    print(f"  K = {k} → {score:.4f}")


Silhouette Scores per K:
  K = 2 → 0.4951
  K = 3 → 0.4093
  K = 4 → 0.2399
  K = 5 → 0.2552


#### ***Pilih K terbaik***

In [None]:
best_k = max(silhouette_scores, key=silhouette_scores.get)
print(f"\nJumlah cluster terbaik: {best_k}")


Jumlah cluster terbaik: 2


#### ***Final Clustering dengan K terbaik***

In [None]:
best_k = max(silhouette_scores, key=silhouette_scores.get)
print(f"\nJumlah cluster terbaik berdasarkan Silhouette Score: {best_k}")


Jumlah cluster terbaik berdasarkan Silhouette Score: 2


K dipilih dengan cara:
Coba K = 2 hingga K = 5.

Hitung Silhouette Score manual untuk tiap K.
Pilih K dengan nilai Silhouette Score tertinggi.
Dalam kasus ini, K = 2 menghasilkan Silhouette tertinggi (0.1674) dibanding K lain yang nilainya lebih rendah.

In [None]:
#  Lakukan klasterisasi akhir
cluster_labels, final_centroids = k_means_manual(normalized_data, k=best_k)

#### ***Jumlah data per cluster***

In [None]:
# Hitung jumlah data per cluster secara manual tanpa library
cluster_counts = {}
for label in cluster_labels:
    if label in cluster_counts:
        cluster_counts[label] += 1
    else:
        cluster_counts[label] = 1

print("\nJumlah Data per Cluster:")
# Urutkan berdasarkan cluster ID
sorted_cluster_counts = sorted(cluster_counts.items())

for cluster_id, count in sorted_cluster_counts:
    # Menampilkan cluster ID dimulai dari 1 (sesuai contoh sebelumnya)
    print(f"  Cluster {cluster_id + 1}: {count} data")


Jumlah Data per Cluster:
  Cluster 1: 76 data
  Cluster 2: 364 data


In [None]:
print("\nCentroid Tiap Cluster:")
for idx, centroid in enumerate(final_centroids):
    formatted = [f"{val:.2f}" for val in centroid]
    print(f"  Cluster {idx + 1}: {formatted}")


Centroid Tiap Cluster:
  Cluster 1: ['1.92', '2.49', '-0.24', '1.35', '1.62', '-0.13', '1.57', '0.50']
  Cluster 2: ['1.20', '2.55', '0.05', '-0.28', '-0.34', '0.03', '-0.33', '-0.10']


In [None]:
# Hitung dan tampilkan Silhouette Score akhir
final_score = silhouette_score_manual(normalized_data, cluster_labels)
print(f"\nSilhouette Score akhir: {final_score:.4f}")


Silhouette Score akhir: 0.4257


Cluster 1 = pelanggan menengah → promo loyalitas, upselling.

Cluster 2 = pelanggan besar → diskon grosir, layanan prioritas.

### ***Ekperimen 2***

### ***Ekperimen 2 (Memakai Normalisasi Min-Max Scaller)***

#### ***Normalisasi Min-Max manual***

In [None]:
min_values = []
max_values = []

for j in numerik_index:
    col = [row[j] for row in data]
    min_val = min(col)
    max_val = max(col)
    min_values.append(min_val)
    max_values.append(max_val)

minmax_normalized_data = []
for row in data:
    new_row = row[:2]  # Channel & Region tetap
    for i, j in enumerate(numerik_index):
        value = row[j]
        min_val = min_values[i]
        max_val = max_values[i]
        norm = (value - min_val) / (max_val - min_val)
        new_row.append(norm)
    minmax_normalized_data.append(new_row)

# Cek 5 baris pertama Min-Max
print("\n 5 Data Min-Max Normalisasi Pertama (0-1)\n")
for i, row in enumerate(minmax_normalized_data[:5]):
    print(f"{i+1}: {row}")



 5 Data Min-Max Normalisasi Pertama (0-1)

1: [2, 3, 0.11294004351392803, 0.13072723064144984, 0.08146415598693642, 0.0031063046479521397, 0.06542719968645895, 0.027847309136420525]
2: [2, 3, 0.06289902628669258, 0.13282409487629862, 0.10309667266671696, 0.028548418907369668, 0.08058984910836763, 0.036983729662077594]
3: [2, 3, 0.05662160716196455, 0.11918086134825646, 0.08278991560408291, 0.039116428900138056, 0.08605232216343328, 0.16355861493533583]
4: [1, 3, 0.1182544494774762, 0.015535857740016068, 0.045463854187999184, 0.10484189073696668, 0.012345679012345678, 0.03723404255319149]
5: [2, 3, 0.20162642222777044, 0.07291368816633308, 0.07755154833633336, 0.0639339951350996, 0.043454830491867526, 0.10809345014601586]


#### ***Cari K optimal dengan Min-Max Normalisasi***

In [None]:
silhouette_scores_minmax = cari_k_optimal(minmax_normalized_data, k_range=(2,5))
print("\nSilhouette Scores Min-Max:")
for k, score in silhouette_scores_minmax.items():
    print(f"  K = {k} → {score:.4f}")

best_k_minmax = max(silhouette_scores_minmax, key=silhouette_scores_minmax.get)
print(f"\nJumlah cluster terbaik (Min-Max): {best_k_minmax}")



Silhouette Scores Min-Max:
  K = 2 → 0.6129
  K = 3 → 0.6840
  K = 4 → 0.6936
  K = 5 → 0.5669

Jumlah cluster terbaik (Min-Max): 4


#### ***Lakukan K-Means dengan K terbaik untuk data Min-Max***

In [None]:
labels_minmax, centroids_minmax = k_means_manual(minmax_normalized_data, k=best_k_minmax)


#### ***Hitung jumlah data per cluster***

In [None]:
cluster_counts_minmax = {}
for label in labels_minmax:
    if label in cluster_counts_minmax:
        cluster_counts_minmax[label] += 1
    else:
        cluster_counts_minmax[label] = 1

print("\nJumlah Data per Cluster (Min-Max):")
sorted_counts_minmax = sorted(cluster_counts_minmax.items())
for cluster_id, count in sorted_counts_minmax:
    print(f"  Cluster {cluster_id + 1}: {count} data")



Jumlah Data per Cluster (Min-Max):
  Cluster 1: 59 data
  Cluster 2: 37 data
  Cluster 3: 316 data
  Cluster 4: 28 data


#### ***Tampilkan centroid tiap cluster***

In [None]:
# Tampilkan centroid tiap cluster
print("\nCentroid Tiap Cluster (Min-Max):")
for idx, centroid in enumerate(centroids_minmax):
    formatted = [f"{val:.4f}" for val in centroid]
    print(f"  Cluster {idx + 1}: {formatted}")



Centroid Tiap Cluster (Min-Max):
  Cluster 1: ['1.0000', '1.0000', '0.1150', '0.0519', '0.0434', '0.0510', '0.0232', '0.0249']
  Cluster 2: ['2.0000', '1.5135', '0.0559', '0.1349', '0.1872', '0.0333', '0.2037', '0.0322']
  Cluster 3: ['1.3323', '3.0000', '0.1117', '0.0806', '0.0851', '0.0480', '0.0689', '0.0337']
  Cluster 4: ['1.0000', '2.0000', '0.1039', '0.0306', '0.0473', '0.0940', '0.0118', '0.0230']


#### ***Hitung Silhouette Score akhir Min-Max***

In [None]:
# Hitung Silhouette Score akhir Min-Max
final_score_minmax = silhouette_score_manual(minmax_normalized_data, labels_minmax)
print(f"\nSilhouette Score akhir (Min-Max): {final_score_minmax:.4f}")


Silhouette Score akhir (Min-Max): 0.5569


Cluster 1 — Pelanggan Umum
Cluster 2 — Pelanggan Menengah
Cluster 3 — Pelanggan Besar

## ***Eskperimen 3***

### ***Ekspermen 3 : coba range K lebih lebar: (2–7 (Memakai Normalisasi Z-score)***

In [None]:
silhouette_scores_extended = cari_k_optimal(normalized_data, k_range=(2,7))
print("\nSilhouette Scores K=2 hingga K=7:")
for k, score in silhouette_scores_extended.items():
    print(f"  K = {k} → {score:.4f}")



Silhouette Scores K=2 hingga K=7:
  K = 2 → 0.4257
  K = 3 → 0.1714
  K = 4 → 0.2851
  K = 5 → 0.2668
  K = 6 → 0.2813
  K = 7 → 0.2711


#### ***Pilih K terbaik di range 2–7***

In [None]:
# Pilih K terbaik di range 2–7
best_k_extended = max(silhouette_scores_extended, key=silhouette_scores_extended.get)
print(f"\nJumlah cluster terbaik dari K=2 hingga K=7: {best_k_extended}")



Jumlah cluster terbaik dari K=2 hingga K=7: 2


#### ***Lakukan clustering final pakai best K extended***

In [None]:
# Lakukan clustering final pakai best K extended
labels_extended, centroids_extended = k_means_manual(normalized_data, k=best_k_extended)


#### ***Hitung Silhouette Score final extended***

In [None]:
# Hitung Silhouette Score final extended
final_score_extended = silhouette_score_manual(normalized_data, labels_extended)
print(f"Silhouette Score akhir (Extended K): {final_score_extended:.4f}")


Silhouette Score akhir (Extended K): 0.4257


#### ***Hitung jumlah data per cluster***

In [None]:
# Hitung jumlah data per cluster
cluster_counts_extended = {}
for label in labels_extended:
    if label in cluster_counts_extended:
        cluster_counts_extended[label] += 1
    else:
        cluster_counts_extended[label] = 1

print("\nJumlah Data per Cluster (Extended K):")
sorted_counts_extended = sorted(cluster_counts_extended.items())
for cluster_id, count in sorted_counts_extended:
    print(f"  Cluster {cluster_id + 1}: {count} data")

print("\nCentroid Tiap Cluster (Extended K):")
for idx, centroid in enumerate(centroids_extended):
    formatted = [f"{val:.4f}" for val in centroid]
    print(f"  Cluster {idx + 1}: {formatted}")


Jumlah Data per Cluster (Extended K):
  Cluster 1: 364 data
  Cluster 2: 76 data

Centroid Tiap Cluster (Extended K):
  Cluster 1: ['1.1978', '2.5549', '0.0505', '-0.2818', '-0.3384', '0.0272', '-0.3279', '-0.1044']
  Cluster 2: ['1.9211', '2.4868', '-0.2420', '1.3498', '1.6208', '-0.1301', '1.5704', '0.5001']


cluster 1 : Pelanggan Besar dengan Pola Spesifik

Cluster 2 : Pelanggan Menengah & Umum

## Eksperimen 4:

### **Eksperimen 4: Coba beberapa inisialisasi centroid (random start)**

In [None]:
# Jalankan K-Means manual 5 kali, simpan skor Silhouette-nya
best_score = -1
best_labels = None
best_centroids = None

for run in range(5):
    labels, centroids = k_means_manual(normalized_data, k=best_k)
    score = silhouette_score_manual(normalized_data, labels)
    print(f"Run {run+1}: Silhouette = {score:.4f}")
    if score > best_score:
        best_score = score
        best_labels = labels
        best_centroids = centroids

print(f"\nBest Silhouette after multiple runs: {best_score:.4f}")


Run 1: Silhouette = 0.4257
Run 2: Silhouette = 0.4257
Run 3: Silhouette = 0.5897
Run 4: Silhouette = 0.4257
Run 5: Silhouette = 0.5041

Best Silhouette after multiple runs: 0.5897


#### *Hitung jumlah data per cluster dari hasil inisialisasi terbaik*

In [None]:
cluster_counts_multiinit = {}
for label in best_labels:
    if label in cluster_counts_multiinit:
        cluster_counts_multiinit[label] += 1
    else:
        cluster_counts_multiinit[label] = 1

print("\nJumlah Data per Cluster (Multi Inisialisasi):")
sorted_counts_multiinit = sorted(cluster_counts_multiinit.items())
for cluster_id, count in sorted_counts_multiinit:
    print(f"  Cluster {cluster_id + 1}: {count} data")

print("\nCentroid Tiap Cluster (Multi Inisialisasi):")
for idx, centroid in enumerate(best_centroids):
    formatted = [f"{val:.4f}" for val in centroid]
    print(f"  Cluster {idx + 1}: {formatted}")



Jumlah Data per Cluster (Multi Inisialisasi):
  Cluster 1: 409 data
  Cluster 2: 31 data

Centroid Tiap Cluster (Multi Inisialisasi):
  Cluster 1: ['1.2787', '2.5452', '-0.0202', '-0.1856', '-0.1938', '-0.0401', '-0.1917', '-0.0850']
  Cluster 2: ['1.9032', '2.5161', '0.2671', '2.4486', '2.5563', '0.5291', '2.5289', '1.1220']


Cluster 1 : Pelanggan Umum / Mayoritas

Cluster 2 — Pelanggan Besar dengan Pembelian Sangat Tinggi


## ***Kesimpulan***

--Eksperimen Terbaik: Eksperimen 2--

- Menggunakan Min-Max Normalisasi,

- K = 3,

- Silhouette Score = 0.5433 (tertinggi)
Alasan:

Memberikan segmentasi paling seimbang dan terpisah.
Memunculkan 3 klaster yang jelas: pelanggan umum, menengah, dan besar.
Sangat praktis untuk strategi pemasaran dan bisnis, karena setiap segmen dapat diarahkan pada strategi berbeda (diskon, loyalty, upselling).
