# 類別變數特徵工程 - 範例程式碼

本範例將使用 Kaggle 平台競賽中的 Spaceship Titanic 提供的資料集，連結如下：https://www.kaggle.com/competitions/spaceship-titanic 

本程式碼將詳細介紹以下數個知識點：
1. Label Encoding
2. One-Hot Encoding 
3. Ordinal Encoding
4. Frequency Encoding
5. Feature Combination

In [1]:
import numpy as np
import pandas as pd

## 輸入資料

In [2]:
# 輸入資料
raw_data = pd.read_excel("train.xlsx")
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   float64
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   float64
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(8), object(5)
memory usage: 891.5+ KB


In [3]:
# 為學習方便，在此先移除遺失值
raw_data = raw_data.dropna()

## 將原始資料切割成訓練與測試資料

In [4]:
from sklearn.model_selection import train_test_split

trainData, testData = train_test_split(raw_data, test_size = 0.25, random_state = 214)

## Label Encoding

本次介紹以下兩種方法可進行 Label Encoding
1. 使用 sklearn 套件中 LabelEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 進行 Label Encoding

In [5]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數
from sklearn.preprocessing import LabelEncoder

labelencoding = LabelEncoder() # 建立 Label Encoding 物件
train_labelencoding_Destination = labelencoding.fit_transform(trainData["Destination"]) # 使用訓練資料配適 LabelEncoder 的規則後，將訓練資料進行轉換。
test_labelencoding_Destiation = labelencoding.transform(testData["Destination"])

In [6]:
for before_labelencoding, after_labelencoding in zip(trainData["Destination"], train_labelencoding_Destination):
    print("Before Label Encoding: {}, After Label Encoding: {}".format(before_labelencoding, after_labelencoding))

Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: PSO J318.5-22, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 2
Before Label Encoding: 55 Cancri e, Af

In [7]:
# 方法二：使用 Dictionary 資料型態的功能
labelencoding_dict = {
    content: index for index, content in enumerate(trainData["Destination"].unique())
}

train_labelencoding_Destination = trainData["Destination"].apply(lambda x: labelencoding_dict[x])
test_labelencoding_Destination = testData["Destination"].apply(lambda x: labelencoding_dict[x])

In [8]:
for before_labelencoding, after_labelencoding in zip(trainData["Destination"], train_labelencoding_Destination):
    print("Before Label Encoding: {}, After Label Encoding: {}".format(before_labelencoding, after_labelencoding))

Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: PSO J318.5-22, After Label Encoding: 2
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: 55 Cancri e, After Label Encoding: 0
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: TRAPPIST-1e, After Label Encoding: 1
Before Label Encoding: 55 Cancri e, Af

## One-Hot Encoding

本次介紹以下兩種方法進行 One-Hot Encoding
1. 使用 sklearn 套件中的 OneHotEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 變數進行 One-Hot Encoding

In [9]:
# 方法一：使用 sklearn 套件中的 OneHotEncoder 函數
from sklearn.preprocessing import OneHotEncoder

one_hot_encoding = OneHotEncoder() # 建立 One-Hot Encoding 物件
train_onehotencoding_Destination = one_hot_encoding.fit_transform(trainData["Destination"].values.reshape((-1, 1))).toarray()
test_onehotencoding_Destination = one_hot_encoding.transform(testData["Destination"].values.reshape((-1, 1))).toarray()

> 注意：使用 OneHotEncoder 轉換時，資料要先把維度轉換成二維才能轉換喔

In [10]:
for before_onehotencoding, after_onehotencoding in zip(trainData["Destination"], train_onehotencoding_Destination.tolist()):
    print("Before One-Hot Encoding: {}, After One-Hot Encoding: {}".format(before_onehotencoding, after_onehotencoding))

Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1.0, 0.0, 0.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1.0, 0.0, 0.0]
Before One-Hot Encoding: PSO J318.5-22, After One-Hot Encoding: [0.0, 1.0, 0.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1.0, 0.0, 0.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0.0, 0.0, 1.0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: 

In [11]:
# 方法二：使用 Dictionary 資料型態的功能
onehotencoding_dict = {
    content: [1 if index == onehotencoding_index else 0 for onehotencoding_index in range(trainData["Destination"].unique().tolist().__len__())]\
        for index, content in enumerate(trainData["Destination"].unique())
}

train_onehotencoding_Destination = trainData["Destination"].apply(lambda x: onehotencoding_dict[x])
test_onehotencoding_Destination = trainData["Destination"].apply(lambda x: onehotencoding_dict[x])

In [12]:
for before_onehotencoding, after_onehotencoding in zip(trainData["Destination"], train_onehotencoding_Destination):
    print("Before One-Hot Encoding: {}, After One-Hot Encoding: {}".format(before_onehotencoding, after_onehotencoding))

Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1, 0, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1, 0, 0]
Before One-Hot Encoding: PSO J318.5-22, After One-Hot Encoding: [0, 0, 1]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1, 0, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: [0, 1, 0]
Before One-Hot Encoding: 55 Cancri e, After One-Hot Encoding: [1, 0, 0]
Before One-Hot Encoding: TRAPPIST-1e, After One-Hot Encoding: 

## Ordinal Encoding

本次介紹兩種方法進行 Ordinal Encoding
1. 使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序（比較不推薦）
2. 使用 sklearn 套件中的 OrdinalEncoder 函數且要自定義類別順序（比較推薦）
2. 使用 Dictionary 資料型態的功能

舉例：Cabin 變數中包含甲板、座位號碼以及靠窗或靠走道，其中甲板有等級之分，"B" 為等級最低而 "T" 的等級最高，因此針對「甲板」這個類別獨立出來變成一個變數並進行 Ordinal Encoding

In [13]:
# 將 Cabin 中的 deck 產生出來
raw_data["Deck"] = raw_data["Cabin"].str.split("/").str[0]
trainData["Deck"] = trainData["Cabin"].str.split("/").str[0]
testData["Deck"] = testData["Cabin"].str.split("/").str[0]

In [14]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序
from sklearn.preprocessing import LabelEncoder
labelencoding = LabelEncoder()
labelencoding.fit(np.sort(raw_data["Deck"].unique()))

train_labelencoding_Deck = labelencoding.transform(trainData["Deck"])
test_labelencoding_Deck = labelencoding.transform(testData["Deck"])

In [15]:
for before_ordinal_encoding, after_ordinal_encoding in zip(trainData["Deck"], train_labelencoding_Deck):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinal_encoding, after_ordinal_encoding))

Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: C, After Ordinal Encoding: 2
Before Ordinal Encoding: B, After Ordinal Encoding: 1
Before Ordinal Encoding: C, After Ordinal Encoding: 2
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, 

In [16]:
# 方法二：使用 sklearn 中的 OrdinalEncoder 函數
from sklearn.preprocessing import OrdinalEncoder
ordinalencoding = OrdinalEncoder()
ordinalencoding.fit( raw_data["Deck"].values.reshape((-1, 1)) )

train_ordinalencoding_Deck = ordinalencoding.transform(trainData["Deck"].values.reshape((-1, 1)) )
test_ordinalencoding_Deck = ordinalencoding.transform(testData["Deck"].values.reshape((-1, 1)) )

In [17]:
for before_ordinalencoding, after_ordinalencoding in zip(trainData["Deck"], train_ordinalencoding_Deck):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinalencoding, after_ordinalencoding))

Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: F, After Ordinal Encoding: [5.]
Before Ordinal Encoding: F, After Ordinal Encoding: [5.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: E, After Ordinal Encoding: [4.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: E, After Ordinal Encoding: [4.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: C, After Ordinal Encoding: [2.]
Before Ordinal Encoding: B, After Ordinal Encoding: [1.]
Before Ordinal Encoding: C, After Ordinal Encoding: [2.]
Before Ordinal Encoding: E, After Ordinal Encoding: [4.]
Before Ordinal Encoding: G, After Ordinal Encoding: [6.]
Before Ordinal Encoding: F, After Ordinal Encoding: [5.]
Before Ordinal Encoding: F, After Ordinal Encoding: [5.]
Before Ordinal Encoding: G, Aft

In [18]:
# 方法三：使用 Dictionary 資料型態的功能
ordinalencoding_dict = {
    content: index for index, content in enumerate(np.sort(raw_data["Deck"].unique()))
}

train_ordinalencoding_Deck = trainData["Deck"].apply(lambda x: ordinalencoding_dict[x])
test_ordinalencoding_Deck = testData["Deck"].apply(lambda x: ordinalencoding_dict[x])

In [19]:
for before_ordinalencoding, after_ordinalencoding in zip(trainData["Deck"], train_ordinalencoding_Deck):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinalencoding, after_ordinalencoding))

Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: C, After Ordinal Encoding: 2
Before Ordinal Encoding: B, After Ordinal Encoding: 1
Before Ordinal Encoding: C, After Ordinal Encoding: 2
Before Ordinal Encoding: E, After Ordinal Encoding: 4
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: F, After Ordinal Encoding: 5
Before Ordinal Encoding: G, After Ordinal Encoding: 6
Before Ordinal Encoding: G, 

## Frequency Encoding

本次主要使用 Dictionary 資料型態的功能實現 Frequency Encoding，步驟如下：
1. 先計算各類別的數量
2. 將計算結果轉換成 Dictionary 資料型態
3. 進行類別轉換

舉例：將 Destination 變數進行 Frequency Encoding

In [20]:
# Step1.、Step2. 先計算各類別的數量，且以 Dictionary 資料型態呈現
frequency_dict = trainData["Destination"].value_counts().to_dict()

# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
train_frequencyencoding_Destination = trainData["Destination"].apply(lambda x: frequency_dict[x] if x in list(frequency_dict.keys()) else 0)
test_frequencyencoding_Destination = testData["Destination"].apply(lambda x: frequency_dict[x] if x in list(frequency_dict.keys()) else 0)

In [21]:
for before_frequency_encoding, after_frequency_encoding in zip(trainData["Destination"], train_frequencyencoding_Destination):
    print("Before Frequency Encoding: {}, After Frequency Encoding: {}".format(before_frequency_encoding, after_frequency_encoding))

Before Frequency Encoding: 55 Cancri e, After Frequency Encoding: 1088
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: 55 Cancri e, After Frequency Encoding: 1088
Before Frequency Encoding: PSO J318.5-22, After Frequency Encoding: 473
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: 55 Cancri e, After Frequency Encoding: 1088
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Before Frequency Encoding: 55 Cancri e, After Frequency Encoding: 1088
Before Frequency Encoding: TRAPPIST-1e, After Frequency Encoding: 3393
Befor

## Feature Combination

本次將介紹使用 Dictionary 資料型態的功能進行類別特徵的特徵合併，其步驟如下：
1. 建構合併規則，並以 Dictionary 資料型態呈現
2. 將類別變數作轉換

舉例：Destination 與 VIP 兩個類別變數進行混合

In [29]:
# Step1. 把 Destination 與 VIP 的類別取出來
one_categorical_feature_class = trainData["Destination"].unique()
two_categorical_feature_class = trainData["VIP"].unique()

# Step2. 建立一個二類別變數 Dict
combination_dict = {
    one_class: {
        two_class: "{}_{}".format(one_class, two_class) for two_class in two_categorical_feature_class
        } for one_class in one_categorical_feature_class
} 

# Step3. 在資料表中產生新特徵
trainData["Destination_VIP"] = trainData.apply(lambda x: combination_dict[x["Destination"]][x["VIP"]], axis = 1)
print(trainData["Destination_VIP"])

6923    55 Cancri e_0.0
3087    TRAPPIST-1e_0.0
6749    TRAPPIST-1e_0.0
3930    TRAPPIST-1e_0.0
3333    TRAPPIST-1e_0.0
             ...       
2043    55 Cancri e_0.0
159     TRAPPIST-1e_0.0
2185    TRAPPIST-1e_0.0
2326    55 Cancri e_0.0
8239    TRAPPIST-1e_0.0
Name: Destination_VIP, Length: 4954, dtype: object
