# 問題定義

將性別欄位 Label Encoding

將客戶分成 3 群，並解讀各群特質

計算 k = 2 ~ 15 的 Calinski-Harbasz Score，找出最佳 k 值

# 資料收集

In [None]:
!wget -O car_models.csv https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv

--2025-05-27 07:56:50--  https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2839 (2.8K) [text/plain]
Saving to: ‘car_models.csv’


2025-05-27 07:56:51 (32.4 MB/s) - ‘car_models.csv’ saved [2839/2839]



In [None]:
import pandas as pd
url='https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv'
df=pd.read_csv(url)
df.head()

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）
0,女,74,38,81
1,女,51,71,91
2,女,30,65,10
3,女,88,49,17
4,女,55,48,70


# 資料前處理


資料清理

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   性別           200 non-null    object
 1   年齡           200 non-null    int64 
 2   收入（千）        200 non-null    int64 
 3   消費指數（1~100）  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 6.4+ KB


探索性分析

In [None]:
df_cor = df.drop(columns='性別').corr()
df_cor

Unnamed: 0,年齡,收入（千）,消費指數（1~100）
年齡,1.0,0.031519,-0.127454
收入（千）,0.031519,1.0,0.031476
消費指數（1~100）,-0.127454,0.031476,1.0


資料分割

In [None]:
df['性別'].unique()

array(['女', '男'], dtype=object)

類別轉換

In [None]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['女','男']])
df[['性別']] = encoder.fit_transform(df[['性別']])
df

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）
0,0.0,74,38,81
1,0.0,51,71,91
2,0.0,30,65,10
3,0.0,88,49,17
4,0.0,55,48,70
...,...,...,...,...
195,1.0,86,84,82
196,1.0,59,52,30
197,0.0,63,29,61
198,1.0,67,80,9


# 模型訓練

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)

In [None]:
df['cluster'] = kmeans.labels_
df

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）,cluster
0,0.0,74,38,81,0
1,0.0,51,71,91,2
2,0.0,30,65,10,1
3,0.0,88,49,17,1
4,0.0,55,48,70,0
...,...,...,...,...,...
195,1.0,86,84,82,2
196,1.0,59,52,30,1
197,0.0,63,29,61,0
198,1.0,67,80,9,1


模型評估

In [None]:
df.groupby('cluster').mean()

Unnamed: 0_level_0,性別,年齡,收入（千）,消費指數（1~100）
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.520548,52.150685,40.479452,70.726027
1,0.592105,57.842105,60.381579,19.934211
2,0.607843,46.254902,82.764706,75.176471


In [None]:
from sklearn.metrics import calinski_harabasz_score
score = calinski_harabasz_score(df.drop(columns='cluster'), kmeans.labels_)
score

96.58085896888944

模型調整

In [None]:
df_nocluster = df.drop(columns='cluster')

for i in range(2,16):
  kmeans = KMeans(n_clusters=i)
  kmeans.fit(df)
  score = calinski_harabasz_score(df_nocluster, kmeans.labels_)
  print(f'k={i} score={score}')

k=2 score=111.68453199285265
k=3 score=95.36742753172948
k=4 score=91.0309757105179
k=5 score=80.47989136078031
k=6 score=80.74469965276678
k=7 score=77.74481329049037
k=8 score=91.3190489728096
k=9 score=98.22069449953958
k=10 score=80.3728274632256
k=11 score=93.5381601404456
k=12 score=85.64850216200993
k=13 score=87.08298163748574
k=14 score=86.08599927222349
k=15 score=81.23514397162538
