# Ch 4. 模型擬合、評估與超參數調校
## 4-1. 工作流程管道化
[4-1. 工作流程管道化](#sec4_1)  
***

<a id='sec4_1'></a>
## 4-1. 工作流程管道化
#### 目標：以寶可夢的數值型特徵預測是否擁有雙屬性
#### 新增一個hasType2欄位，標明是否有雙屬性

In [1]:
import pandas as pd

df = pd.read_csv('Pokemon_894_12.csv')
df['hasType2'] = df['Type2'].notnull().astype(int)
print('雙屬性的數量：', df['hasType2'].sum())
print('單屬性的數量：', df.shape[0]-df['hasType2'].sum())
df.tail(3)

雙屬性的數量： 473
單屬性的數量： 421


Unnamed: 0,Number,Name,Type1,Type2,HP,Attack,Defense,SpecialAtk,SpecialDef,Speed,Generation,Legendary,hasType2
891,805,壘磊石,Rock,Steel,61,131,211,53,101,13,7,False,1
892,806,砰頭小丑,Fire,Ghost,53,127,53,151,79,107,7,False,1
893,807,捷拉奧拉,Electric,,88,112,75,102,80,143,7,False,0


#### 依下列轉換步驟進行，再以邏輯斯迴歸建模預測是否有雙屬性。
1. 切割訓練與測試集，其中測試集佔20%。
2. 挑選數值特徵與目標項間ANOVA的F值最高前兩名。

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif

X, y = df.loc[:, 'HP':'Speed'], df['hasType2']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 依 ANOVA F-Value 挑選前兩名的特徵
select = SelectKBest(f_classif, k=2).fit(X_train, y_train)
print(select.get_support())
print('挑出的特徵：', X.columns[select.get_support()])
X_train_new = select.transform(X_train)
X_train_new.shape

[False False  True  True False False]
挑出的特徵： Index(['Defense', 'SpecialAtk'], dtype='object')


(715, 2)

3. 對挑選出來的特徵進行標準化

In [3]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler().fit(X_train_new)
X_train_std = scale.transform(X_train_new)

X_test_new = select.transform(X_test)
X_test_std = scale.transform(X_test_new)
X_test_std[:3, :]

array([[0.86655534, 1.47194431],
       [1.03103292, 2.85717958],
       [1.03103292, 0.67158616]])

4. 利用邏輯斯迴歸建模，並對測試集進行測試，最後輸出準確率。

In [4]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(penalty='l2')
logit.fit(X_train_std, y_train)
logit.score(X_test_std, y_test)

0.6424581005586593

經過上述在前處理階段的特徵挑選、轉換與邏輯斯迴歸建模，最終得到準確率為0.64。  
接著，底下利用管道化方式將挑選、轉換與建模整合成一條龍作業。

#### 管道化

In [5]:
from sklearn.pipeline import Pipeline

select = SelectKBest(f_classif, k=2)

pipe_lr = Pipeline([('selK', SelectKBest(f_classif, k=2)),
                    ('sc', StandardScaler()),
                    ('clf', LogisticRegression(penalty='l2'))
                    ])
pipe_lr.fit(X_train, y_train)
pipe_lr.score(X_test, y_test)

0.6424581005586593

In [6]:
from sklearn import set_config
set_config(display='diagram')
pipe_lr

#### 管道化：採用不同處理方式或是整合多種類型特徵的轉換步驟
1. 加入類別型特徵 Generation 的獨熱編碼
2. 從中挑選與目標項卡方值最高的前三名做為特徵
3. 與前面挑選出來的數值型特徵一併透過邏輯斯迴歸來建模

In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import chi2
from sklearn.compose import ColumnTransformer

# 處理數值型特徵
num_features = X.columns
num_transform = Pipeline([('selK', SelectKBest(f_classif, k=2)), 
                          ('sc', StandardScaler())
                         ])
# 處理類別型特徵
cat_features = ['Generation']
cat_transform = Pipeline([('onehot', OneHotEncoder()), 
                          ('selK', SelectKBest(chi2, k=3))
                         ])
# 整合兩個處理步驟
pre = ColumnTransformer( 
    transformers=[('num', num_transform, num_features), 
                  ('cat', cat_transform, cat_features)
                 ])
# 管道化
clf = Pipeline(steps=[('preprocessor', pre),
                      ('clf', LogisticRegression(penalty='l2'))])

X, y = df.loc[:, 'HP':'Generation'], df['hasType2']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.6256983240223464

In [8]:
clf