這份 Jupyter Notebook 檔案 (6-2-2.ipynb) 完整地展示了在機器學習預處理中，對類別特徵 (categorical features) 執行「獨熱編碼 (One-Hot Encoding)」的四種不同方法：

Pandas (pd.get_dummies)

Scikit-learn (sklearn.preprocessing.OneHotEncoder)

Feature-engine (feature_engine.encoding.OneHotEncoder)

Category Encoders (category_encoders.one_hot.OneHotEncoder)

這幾種方法各有優劣，其中 Scikit-learn、Feature-engine 和 Category Encoders 的做法（先 fit 訓練集，再 transform 訓練集與測試集）是業界標準，可以有效避免「資料洩漏」(data leakage) 並確保資料一致性。

In [158]:
# 使用pandas
import pandas as pd
from sklearn.model_selection import train_test_split

In [159]:
# Download the dataset
!wget https://raw.githubusercontent.com/taipeihugo/Feature-Engineering/main/credit_approval_uci.csv -q -O credit_approval_uci.csv

In [160]:
data = pd.read_csv("credit_approval_uci.csv")
data

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,280.0,824,1
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,0
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,200.0,394,0
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,200.0,1,0
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0


In [161]:
#將資料分成training sets與testing sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=["target"], axis=1),  # predictors
    data["target"],  # target
    test_size=0.3,  # percentage of observations in test set
    random_state=0,  # seed to ensure reproducibility
)
X_train.shape, X_test.shape

((483, 15), (207, 15))

In [162]:
#檢視A4中出現的值
# 檢視訓練集 (X_train) 中 "A4" 欄位的所有唯一值 (unique values)。
# 這一步的目的是了解這個類別特徵包含哪些不同的類別，
# 從輸出的 'Missing' 可以看到，這個欄位也包含了表示缺失值的特定字串。

X_train["A4"].unique()

array(['u', 'y', 'Missing', 'l'], dtype=object)

# 方法一：使用 Pandas

In [163]:
# 對A4進行one-hot encoding
# 使用 pandas 的 pd.get_dummies 函式對 "A4" 欄位進行獨熱編碼。
# drop_first=False：這是預設值（但這裡明確寫出），表示 k 個類別會轉換成 k 個新的虛擬變數 (dummy variable) 欄位。
# dtype='int'：將新產生的欄位資料型別設為整數 (0 或 1)。
# .head() 顯示編碼後前 5 筆結果。

dummies = pd.get_dummies(
    X_train["A4"],
    drop_first=False,
    dtype='int'
)
dummies.head()

Unnamed: 0,Missing,l,u,y
596,0,0,1,0
303,0,0,1,0
204,0,0,0,1
351,0,0,0,1
118,0,0,1,0


In [164]:
# let's one hot encode A4 into k-1 variables

# 再次使用 pd.get_dummies，但這次設定 drop_first=True。
# drop_first=True：這會產生 k-1 個虛擬變數。它會丟棄 k 個類別中的第一個類別（按字母排序，這裡是 'Missing'）。
# 這樣做是為了避免「虛擬變數陷阱」(dummy variable trap)，
# 也就是新欄位間的完全多重共線性 (multicollinearity)，這在某些線性模型中是必要的。

dummies = pd.get_dummies(
    X_train["A4"],
    drop_first=True,
    dtype='int'
)
dummies.head()

Unnamed: 0,l,u,y
596,0,1,0
303,0,1,0
204,0,0,1
351,0,0,1
118,0,1,0


In [165]:
# Now let's encode all cateogrical variables simultaneously
# into k-1: train set

# 將 pd.get_dummies 應用於整個訓練集 DataFrame (X_train)。
# pandas 會自動偵測所有 'object' 或 'category' 型別的欄位，並對它們進行獨熱編碼。
# 數值型別的欄位（如 A2, A3...）會被保留不動。
# drop_first=True 會應用到所有被編碼的類別欄位上。
# 結果儲存在 X_train_enc，並顯示前 5 筆。
#
# [注意]：這種方法雖然方便，但在實務上不推薦。
# 因為如果測試集 (X_test) 的類別與訓練集 (X_train) 不完全相同
# （例如測試集多了或少了某個類別），分別對二者使用 get_dummies 會導致欄位無法對齊。

X_train_enc = pd.get_dummies(
    X_train,
    drop_first=True,
    dtype='int'
)
X_train_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
596,46.08,3.000,2.375,8,396.0,4159,1,0,0,1,...,0,0,0,1,0,1,1,1,0,0
303,15.92,2.875,0.085,0,120.0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0
204,36.33,2.125,0.085,1,50.0,1187,0,1,0,0,...,0,0,0,1,0,1,1,0,0,0
351,22.17,0.585,0.000,0,100.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
118,57.83,7.040,14.000,6,360.0,1332,0,1,0,1,...,0,0,0,1,0,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,36.75,4.710,0.000,0,160.0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
192,41.75,0.960,2.500,0,510.0,600,0,1,0,1,...,0,0,0,1,0,1,0,0,0,0
629,19.58,0.665,1.665,0,220.0,5,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0
559,22.83,2.290,2.290,7,140.0,2384,1,0,0,1,...,0,0,0,0,0,1,1,1,0,0


In [166]:
# and now in the test set.
# 對測試集 (X_test) 執行與上一步相同的 get_dummies 操作。
# 這裡就潛藏著（上個儲存格註解中提到的）欄位不一致的風險。

X_test_enc = pd.get_dummies(X_test, drop_first=True, dtype='int')

X_test_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,45.83,10.50,5.000,7,0.0,0,1,0,0,1,...,0,0,0,1,0,1,1,1,0,0
586,64.08,20.00,17.500,9,0.0,1000,0,1,0,1,...,0,0,0,0,0,1,1,1,0,0
140,31.25,3.75,0.625,9,181.0,0,1,0,0,1,...,0,0,0,0,0,1,1,1,0,0
492,39.25,9.50,6.500,14,240.0,4607,0,1,0,1,...,0,0,0,1,0,1,1,0,0,0
350,26.17,2.00,0.000,0,276.0,1,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,28.67,1.04,2.500,5,300.0,1430,1,0,0,1,...,0,0,0,1,0,1,1,1,0,0
380,43.17,5.00,2.250,0,141.0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,0
369,21.42,0.75,0.750,0,132.0,2,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0
362,26.83,0.54,0.000,0,100.0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [167]:
# [注意] 這個儲存格的邏輯是錯誤且多餘的。
# 上個儲存格產生的 X_test_enc 已經是包含「數值欄位 + 編碼後欄位」的完整 DataFrame。
#
# 這段程式碼的 (錯誤) 邏輯是：
# 1. pd.concat：將「原始的 X_test」和「已經編碼過的 X_test_enc」再次水平合併。
#    這導致所有數值欄位 (A2, A3...) 都重複出現了兩次。
# 2. .drop：從這個合併後的 DataFrame 中，刪除所有 'O' (object) 型別的欄位（即原始的類別欄位）。
# 3. .head()：顯示這個欄位重複的奇怪結果 (有 48 欄，而訓練集只有 42 欄)。
# (在正常的 pandas 流程中，上2個儲存格已經完成任務，這個儲存格應被忽略)。

# Add one-hot encoded variables to the original dataset.
X_test_enc = pd.concat([X_test, X_test_enc], axis=1)
# Drop the categorical variables
X_test_enc.drop(
    labels=X_test_enc.select_dtypes(include="O").columns,
    axis=1,
    inplace=True,
)

# Show data
X_test_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A2.1,A3.1,A8.1,A11.1,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,45.83,10.50,5.000,7,0.0,0,45.83,10.50,5.000,7,...,0,0,0,1,0,1,1,1,0,0
586,64.08,20.00,17.500,9,0.0,1000,64.08,20.00,17.500,9,...,0,0,0,0,0,1,1,1,0,0
140,31.25,3.75,0.625,9,181.0,0,31.25,3.75,0.625,9,...,0,0,0,0,0,1,1,1,0,0
492,39.25,9.50,6.500,14,240.0,4607,39.25,9.50,6.500,14,...,0,0,0,1,0,1,1,0,0,0
350,26.17,2.00,0.000,0,276.0,1,26.17,2.00,0.000,0,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,28.67,1.04,2.500,5,300.0,1430,28.67,1.04,2.500,5,...,0,0,0,1,0,1,1,1,0,0
380,43.17,5.00,2.250,0,141.0,0,43.17,5.00,2.250,0,...,0,0,0,0,0,0,0,1,0,0
369,21.42,0.75,0.750,0,132.0,2,21.42,0.75,0.750,0,...,0,1,0,0,0,0,0,1,0,0
362,26.83,0.54,0.000,0,100.0,0,26.83,0.54,0.000,0,...,0,0,0,0,0,0,0,0,0,0


# 方法二：使用 Scikit-learn (推薦)

In [168]:
#使用scikit-learn

# 匯入 (import) scikit-learn.preprocessing 模組中的 OneHotEncoder 類別。
# 這是更標準、更穩健的作法，因為它能「學習」訓練集的類別，並「一致地」應用於測試集。

from sklearn.preprocessing import OneHotEncoder

In [169]:
# we create and train the encoder

# 建立 (初始化) OneHotEncoder 物件。
# drop="first"：功能同 pd.get_dummies 的 drop_first=True，會產生 k-1 個虛擬變數以避免多重共線性。
# sparse_output=False：設定輸出格式為一般的 (dense) NumPy 陣列。
# (如果為 True (預設值)，它會返回一個稀疏矩陣 (sparse matrix)，這在類別非常多時能節省記憶體，
# 但在這裡我們為了方便觀察和轉換回 DataFrame，將其設為 False。)

encoder = OneHotEncoder(
    drop="first",  # to return k-1
    sparse_output=False,
)

In [170]:
# 儲存格 12

# Make a list with the categorical variables

# 自動找出 X_train 中所有資料型別為 'O' (Object) 的欄位（即類別變數）。
# .columns 取得欄位名稱，.to_list() 將其轉換為一個列表，存入 'vars_categorical' 變數中。

vars_categorical = X_train.select_dtypes(include="O").columns.to_list()

vars_categorical

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

In [171]:
# fit the encoder to the train set:
# it will learn the categories to encode.

# 訓練 (fit) 編碼器。
# 這是 scikit-learn 流程的核心：只使用「訓練集」(`X_train`) 中的「類別變數」(`vars_categorical`) 來訓練 (fit) encoder。
# Encoder 會學習並「記住」這些欄位中出現過的所有唯一類別。

encoder.fit(X_train[vars_categorical])

In [172]:
# 檢視 encoder 學習並儲存的類別。
# 它會返回一個列表，列表中的每個元素 (array) 依序對應 'vars_categorical' 中每個欄位所學到的唯一類別。

encoder.categories_

[array(['Missing', 'a', 'b'], dtype=object),
 array(['Missing', 'l', 'u', 'y'], dtype=object),
 array(['Missing', 'g', 'gg', 'p'], dtype=object),
 array(['Missing', 'aa', 'c', 'cc', 'd', 'e', 'ff', 'i', 'j', 'k', 'm',
        'q', 'r', 'w', 'x'], dtype=object),
 array(['Missing', 'bb', 'dd', 'ff', 'h', 'j', 'n', 'o', 'v', 'z'],
       dtype=object),
 array(['f', 't'], dtype=object),
 array(['f', 't'], dtype=object),
 array(['f', 't'], dtype=object),
 array(['g', 'p', 's'], dtype=object)]

In [173]:
# Encode variables in the train and test sets

# 使用上一步「已經訓練好」的 encoder，來轉換 (transform) 訓練集和測試集。
# 這是關鍵步驟：
# 1. 'fit' 只能對訓練集做。
# 2. 'transform' 必須對訓練集和測試集「都做」。
# 這樣能確保測試集是使用和訓練集「完全相同」的標準進行編碼。
# 即使測試集中出現了訓練集沒見過的類別，也能(根據設定)被妥善處理(例如拋出錯誤或忽略)。

X_train_enc = encoder.transform(X_train[vars_categorical])
X_test_enc = encoder.transform(X_test[vars_categorical])

In [174]:
# Scikit-learn returns a Numpy array

# 顯示編碼後的測試集 (X_test_enc)。
# 如前所述 (sparse_output=False)，scikit-learn 的 transform 方法返回的是 NumPy 陣列，
# 它沒有欄位名稱或索引 (index)。

X_test_enc

array([[1., 0., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [175]:
# Obtain the binary variable names

# 呼叫 .get_feature_names_out() 方法，以取得編碼後的新特徵 (欄位) 名稱。
# 這些名稱會依據 "原始欄位名_類別名" 的格式產生。

encoder.get_feature_names_out()

array(['A1_a', 'A1_b', 'A4_l', 'A4_u', 'A4_y', 'A5_g', 'A5_gg', 'A5_p',
       'A6_aa', 'A6_c', 'A6_cc', 'A6_d', 'A6_e', 'A6_ff', 'A6_i', 'A6_j',
       'A6_k', 'A6_m', 'A6_q', 'A6_r', 'A6_w', 'A6_x', 'A7_bb', 'A7_dd',
       'A7_ff', 'A7_h', 'A7_j', 'A7_n', 'A7_o', 'A7_v', 'A7_z', 'A9_t',
       'A10_t', 'A12_t', 'A13_p', 'A13_s'], dtype=object)

In [176]:
# 上一步建立的 X_test_enc DataFrame 的索引 (index) 是預設的 0, 1, 2...
# 這一步將其索引替換為「原始」X_test DataFrame 的索引。
# 這非常重要！因為後續我們需要依據這個索引，才能將「數值欄位」和「編碼後的欄位」正確地合併在一起。

# Transform the array to a pandas dataframe
X_test_enc = pd.DataFrame(X_test_enc)

# Add the variable names
X_test_enc.columns = encoder.get_feature_names_out()

# Show dataset
X_test_enc

Unnamed: 0,A1_a,A1_b,A4_l,A4_u,A4_y,A5_g,A5_gg,A5_p,A6_aa,A6_c,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
203,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
204,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
205,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [177]:
# Replace index in transformed dataset by
# the index in the original dataset.

# 上一步建立的 X_test_enc DataFrame 的索引 (index) 是預設的 0, 1, 2...
# 這一步將其索引替換為「原始」X_test DataFrame 的索引。
# 這非常重要！因為後續我們需要依據這個索引，才能將「數值欄位」和「編碼後的欄位」正確地合併在一起。

X_test_enc.index = X_test.index

X_test_enc

Unnamed: 0,A1_a,A1_b,A4_l,A4_u,A4_y,A5_g,A5_gg,A5_p,A6_aa,A6_c,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
586,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
140,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
492,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
350,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
380,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
369,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
362,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [178]:
# Add the one-hot encoded variables to the
# original dataset.

# 這是 scikit-learn 流程的最後組合步驟：
# 將「原始的 X_test」(包含數值欄位和舊的類別欄位) 與「編碼後的 X_test_enc」(只包含新編碼的欄位)
# 依據索引 (index) 進行水平合併 (axis=1)。

X_test_enc = pd.concat([X_test, X_test_enc], axis=1)
X_test_enc

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,a,45.83,10.50,u,g,q,v,5.000,t,t,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
586,b,64.08,20.00,u,g,x,h,17.500,t,t,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
140,a,31.25,3.75,u,g,cc,h,0.625,t,t,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
492,b,39.25,9.50,u,g,m,v,6.500,t,t,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
350,a,26.17,2.00,u,g,j,j,0.000,f,f,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,a,28.67,1.04,u,g,c,v,2.500,t,t,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
380,b,43.17,5.00,u,g,i,bb,2.250,f,f,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
369,b,21.42,0.75,y,p,r,n,0.750,f,f,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
362,b,26.83,0.54,u,g,k,ff,0.000,f,f,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [179]:
# Drop the categorical variables
# 從合併後的 DataFrame 中，刪除「原始的類別欄位」(欄位名稱儲存在 'vars_categorical' 列表中)。

X_test_enc.drop(labels=vars_categorical, axis=1, inplace=True)
X_test_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,45.83,10.50,5.000,7,0.0,0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
586,64.08,20.00,17.500,9,0.0,1000,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
140,31.25,3.75,0.625,9,181.0,0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
492,39.25,9.50,6.500,14,240.0,4607,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
350,26.17,2.00,0.000,0,276.0,1,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,28.67,1.04,2.500,5,300.0,1430,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
380,43.17,5.00,2.250,0,141.0,0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
369,21.42,0.75,0.750,0,132.0,2,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
362,26.83,0.54,0.000,0,100.0,0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [180]:
# Show data
# 最終，X_test_enc 成為了只包含「原始數值欄位」和「新編碼欄位」的完整、可供模型使用的測試集。

X_test_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_j,A7_n,A7_o,A7_v,A7_z,A9_t,A10_t,A12_t,A13_p,A13_s
14,45.83,10.50,5.000,7,0.0,0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
586,64.08,20.00,17.500,9,0.0,1000,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
140,31.25,3.75,0.625,9,181.0,0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
492,39.25,9.50,6.500,14,240.0,4607,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
350,26.17,2.00,0.000,0,276.0,1,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,28.67,1.04,2.500,5,300.0,1430,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
380,43.17,5.00,2.250,0,141.0,0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
369,21.42,0.75,0.750,0,132.0,2,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
362,26.83,0.54,0.000,0,100.0,0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 方法三：使用 Feature-engine (推薦)

In [181]:
!pip install feature_engine -q

In [182]:
#使用Feature-engine

# 匯入 (import) feature_engine.encoding 模組中的 OneHotEncoder。
# Feature-engine 是一個專為特徵工程設計的套件，
# 它的 API 設計與 scikit-learn 相容 (都有 .fit() 和 .transform())，
# 但在使用上更為便捷，因為它能直接處理 DataFrame 並返回 DataFrame。

from feature_engine.encoding import OneHotEncoder

In [183]:
# let's create an encoder to return k-1 binary variables

# 建立 (初始化) feature-engine 的 OneHotEncoder 物件。
# drop_last=True：功能同 scikit-learn 的 drop="first"，
# 差異在於它丟棄的是「最後一個」類別，同樣是為了產生 k-1 個虛擬變數。

ohe_enc = OneHotEncoder(drop_last=True)

In [184]:
# fit the encoder to the train set: it will learn the variables and
# categories to encode

# 訓練 (fit) feature-engine 的編碼器。
# 注意：feature-engine 的 .fit() 方法可以直接傳入「整個」X_train DataFrame。
# 它會「自動偵測」所有 'O' (Object) 型別的欄位，並只對這些欄位學習類別。
# (數值欄位會被自動忽略，但會在 transform 步驟中被保留)。

ohe_enc.fit(X_train)

In [185]:

# we can see which variables the encoder will encode

# 檢視編碼器學習到的類別字典。
# 它儲存了每個類別變數 (key) 及其對應的所有類別 (value 是一個列表)。

ohe_enc.variables_

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

In [186]:
# The categories that will be encoded

ohe_enc.encoder_dict_

{'A1': ['a', 'b'],
 'A4': ['u', 'y', 'Missing'],
 'A5': ['g', 'p', 'Missing'],
 'A6': ['c',
  'q',
  'w',
  'ff',
  'm',
  'i',
  'e',
  'cc',
  'x',
  'd',
  'k',
  'j',
  'Missing',
  'aa'],
 'A7': ['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n'],
 'A9': ['t'],
 'A10': ['t'],
 'A12': ['t'],
 'A13': ['g', 's']}

In [187]:
# let's transform train and test set

# 轉換 (transform) 訓練集與測試集。
# 這是 feature-engine 最方便的地方：
# .transform() 方法會直接返回一個「完整」的 DataFrame，
# 這個 DataFrame 已經包含了「保留的數值欄位」和「新編碼的欄位」，
# 並且「自動移除」了原始的類別欄位。
# (省去了 scikit-learn 流程中 儲存格 16 到 20 的手動組合步驟)。

X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

In [188]:
# let's inspect the encoded train set

# 顯示 feature-engine 轉換後的訓練集 (X_train_enc) 前 5 筆資料。
# 可以看到結果已是包含 42 個欄位 (6 數值 + 36 編碼) 的完整 DataFrame。

X_train_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_u,A4_y,...,A7_z,A7_bb,A7_j,A7_Missing,A7_n,A9_t,A10_t,A12_t,A13_g,A13_s
596,46.08,3.000,2.375,8,396.0,4159,1,0,1,0,...,0,0,0,0,0,1,1,1,1,0
303,15.92,2.875,0.085,0,120.0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
204,36.33,2.125,0.085,1,50.0,1187,0,1,0,1,...,0,0,0,0,0,1,1,0,1,0
351,22.17,0.585,0.000,0,100.0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
118,57.83,7.040,14.000,6,360.0,1332,0,1,1,0,...,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,36.75,4.710,0.000,0,160.0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
192,41.75,0.960,2.500,0,510.0,600,0,1,1,0,...,0,0,0,0,0,1,0,0,1,0
629,19.58,0.665,1.665,0,220.0,5,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
559,22.83,2.290,2.290,7,140.0,2384,1,0,1,0,...,0,0,0,0,0,1,1,1,1,0


In [189]:
# let's inspect the encoded test set

# 顯示 feature-engine 轉換後的測試集 (X_test_enc) 前 5 筆資料。
# 欄位也同樣是 42 欄，與訓練集完美對齊。

X_test_enc

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_u,A4_y,...,A7_z,A7_bb,A7_j,A7_Missing,A7_n,A9_t,A10_t,A12_t,A13_g,A13_s
14,45.83,10.50,5.000,7,0.0,0,1,0,1,0,...,0,0,0,0,0,1,1,1,1,0
586,64.08,20.00,17.500,9,0.0,1000,0,1,1,0,...,0,0,0,0,0,1,1,1,1,0
140,31.25,3.75,0.625,9,181.0,0,1,0,1,0,...,0,0,0,0,0,1,1,1,1,0
492,39.25,9.50,6.500,14,240.0,4607,0,1,1,0,...,0,0,0,0,0,1,1,0,1,0
350,26.17,2.00,0.000,0,276.0,1,1,0,1,0,...,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,28.67,1.04,2.500,5,300.0,1430,1,0,1,0,...,0,0,0,0,0,1,1,1,1,0
380,43.17,5.00,2.250,0,141.0,0,0,1,1,0,...,0,1,0,0,0,0,0,1,1,0
369,21.42,0.75,0.750,0,132.0,2,0,1,0,1,...,0,0,0,0,1,0,0,1,1,0
362,26.83,0.54,0.000,0,100.0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,1,0


In [190]:
# The name of the variables in the transformed data

# 取得轉換後 DataFrame 的所有特徵 (欄位) 名稱。
# 這包含了被保留的數值欄位和新產生的獨熱編碼欄位。

ohe_enc.get_feature_names_out()

['A2',
 'A3',
 'A8',
 'A11',
 'A14',
 'A15',
 'A1_a',
 'A1_b',
 'A4_u',
 'A4_y',
 'A4_Missing',
 'A5_g',
 'A5_p',
 'A5_Missing',
 'A6_c',
 'A6_q',
 'A6_w',
 'A6_ff',
 'A6_m',
 'A6_i',
 'A6_e',
 'A6_cc',
 'A6_x',
 'A6_d',
 'A6_k',
 'A6_j',
 'A6_Missing',
 'A6_aa',
 'A7_v',
 'A7_ff',
 'A7_h',
 'A7_dd',
 'A7_z',
 'A7_bb',
 'A7_j',
 'A7_Missing',
 'A7_n',
 'A9_t',
 'A10_t',
 'A12_t',
 'A13_g',
 'A13_s']

# 方法四：使用 Category Encoders

In [191]:
#使用Category Encoders
!pip install category_encoders -q

In [192]:
# 從 category_encoders 函式庫中匯入 OneHotEncoder 類別。
# (注意：這會覆蓋掉 儲存格 10 和 21 中匯入的同名類別)。

from category_encoders.one_hot import OneHotEncoder

In [193]:
# let's create the encoder to return k-1 binary variables
# Category Encoders always returns k-1 dummies

# 建立 (初始化) category_encoders 的 OneHotEncoder 物件。
# use_cat_names=True：設定新欄位名稱使用 "欄位名_類別名" 的格式，
#   (如果為 False，會使用 1, 2, 3... 這種難以辨識的名稱)。
#
# [更正] 原始註解 "always returns k-1" 在此例中不正確。
# 如此範例的輸出 (儲存格 39 和 41) 所示，
# 預設的 category_encoders.OneHotEncoder 會產生 k 個虛擬變數 (k-hot encoding)，
# 並且會自動將 'Missing' 或 NaN 視為一個獨立的類別來編碼。

ohe_enc = OneHotEncoder(use_cat_names=True)

In [194]:
# fit the encoder to the train set: it will learn the variables and
# categories to encode

# 訓練 (fit) category_encoders 的編碼器。
# 類似 feature-engine，它也可以直接對整個 X_train DataFrame 進行 fit，
# 它會自動尋找類別欄位 (object 或 category dtype) 進行學習。

ohe_enc.fit(X_train)

In [195]:
# The variables that will be encoded

# 檢視編碼器自動偵測到將要處理的欄位列表 (儲存在 .cols 屬性中)。

ohe_enc.cols

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

In [196]:
# The names of the new variables

# 取得轉換後的所有特徵名稱。
# (註：這裡出現 FutureWarning，建議未來改用 .get_feature_names_out())
# 從輸出中可以看到，它包含了數值欄位以及所有 k 個類別的編碼欄位 (例如 A9_t 和 A9_f 都出現了)。

ohe_enc.get_feature_names()

  ohe_enc.get_feature_names()


array(['A1_a', 'A1_b', 'A1_Missing', 'A2', 'A3', 'A4_u', 'A4_y',
       'A4_Missing', 'A4_l', 'A5_g', 'A5_p', 'A5_Missing', 'A5_gg',
       'A6_c', 'A6_q', 'A6_w', 'A6_ff', 'A6_m', 'A6_i', 'A6_e', 'A6_cc',
       'A6_x', 'A6_d', 'A6_k', 'A6_j', 'A6_Missing', 'A6_aa', 'A6_r',
       'A7_v', 'A7_ff', 'A7_h', 'A7_dd', 'A7_z', 'A7_bb', 'A7_j',
       'A7_Missing', 'A7_n', 'A7_o', 'A8', 'A9_t', 'A9_f', 'A10_t',
       'A10_f', 'A11', 'A12_t', 'A12_f', 'A13_g', 'A13_s', 'A13_p', 'A14',
       'A15'], dtype=object)

In [197]:
# let's transform train and test set

# 轉換 (transform) 訓練集與測試集。
# 類似 feature-engine，category_encoders 的 .transform() 同樣會返回
# 一個已包含數值欄位並替換掉原始類別欄位的「完整」DataFrame。

X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

In [198]:
# let's inspect the encoded test set

# 顯示 category_encoders 轉換後的測試集前 5 筆資料。
# 注意其欄位數 (51) 與 scikit-learn (42) 或 feature-engine (42) 不同，
# 這是因為它採用了 k-hot 編碼 (k 個類別產生 k 個欄位) 且包含了 'Missing' 類別。

X_test_enc

Unnamed: 0,A1_a,A1_b,A1_Missing,A2,A3,A4_u,A4_y,A4_Missing,A4_l,A5_g,...,A10_t,A10_f,A11,A12_t,A12_f,A13_g,A13_s,A13_p,A14,A15
14,1,0,0,45.83,10.50,1,0,0,0,1,...,1,0,7,1,0,1,0,0,0.0,0
586,0,1,0,64.08,20.00,1,0,0,0,1,...,1,0,9,1,0,1,0,0,0.0,1000
140,1,0,0,31.25,3.75,1,0,0,0,1,...,1,0,9,1,0,1,0,0,181.0,0
492,0,1,0,39.25,9.50,1,0,0,0,1,...,1,0,14,0,1,1,0,0,240.0,4607
350,1,0,0,26.17,2.00,1,0,0,0,1,...,0,1,0,1,0,1,0,0,276.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,1,0,0,28.67,1.04,1,0,0,0,1,...,1,0,5,1,0,1,0,0,300.0,1430
380,0,1,0,43.17,5.00,1,0,0,0,1,...,0,1,0,1,0,1,0,0,141.0,0
369,0,1,0,21.42,0.75,0,1,0,0,0,...,0,1,0,1,0,1,0,0,132.0,2
362,0,1,0,26.83,0.54,1,0,0,0,1,...,0,1,0,0,1,1,0,0,100.0,0
