# 填補特徵
在特徵填補的部分，metric與non-metric的方法不同

- metric:
常使用的方法是平均數或是中位數
- non-metric:
次數出現最多的類別

> 自訂填補器:透過繼承TransformerMixin，就可以定義針對不同行的填補方法。可以加入到機器學習pipeline中。

In [1]:
import pandas as pd

In [2]:
#自行新增一個資料集
X = pd.DataFrame({'city':['tokyo', None, 'london', 'seattle', 'san francisco', 'tokyo'], 
                  'boolean':['yes', 'no', None, 'no', 'no', 'yes'], 
                  'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 'somewhat like', 'dislike'], 
                  'quantitative_column':[1, 11, -.5, 10, None, 20]})

X.head()

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,,no,like,11.0
2,london,,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,


## 自訂填補器

In [3]:
from sklearn.base import TransformerMixin

#### 自訂分類填補器

繼承TransformerMixin後，更改transformer內容。將所選擇的欄位的遺漏值以該行出現最多次數的類別填補

In [4]:
class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
            
        return X
    
    def fit(self, *_):
        return self

#### 自訂定量填補器

繼承TransformerMixin後，更改transformer內容。以平均值填補，使用SimpleImputer

In [5]:
from sklearn.impute import SimpleImputer

class CustomQuantitativeImputer(TransformerMixin):
    def __init__(self, cols=None, strategy='mean'):
        self.cols = cols
        self.strategy = strategy
        
    def transform(self, df):
        X = df.copy()
        impute = SimpleImputer(strategy = self.strategy)
        
        for col in self.cols:
            X[col] = impute.fit_transform(X[[col]])
            
        return X
    
    def fit(self, *_):
        return self

#### 手動執行自訂填補器方法

In [6]:
cci = CustomCategoryImputer(cols=['city','boolean'])
cci.fit_transform(X)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,tokyo,no,like,11.0
2,london,no,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,
5,tokyo,yes,dislike,20.0


In [7]:
cqi = CustomQuantitativeImputer(cols=['quantitative_column'],strategy='mean')
cqi.fit_transform(X)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,,no,like,11.0
2,london,,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,8.3
5,tokyo,yes,dislike,20.0


#### 將自訂的填補器放入pipeline中

In [8]:
from sklearn.pipeline import Pipeline

cci = CustomCategoryImputer(cols=['city','boolean'])
cqi = CustomQuantitativeImputer(cols=['quantitative_column'],strategy='mean')

imputer = Pipeline([('quant', cqi),('category',cci)])
imputer.fit_transform(X)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,tokyo,no,like,11.0
2,london,no,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,8.3
5,tokyo,yes,dislike,20.0


# 編碼變數

- non-metric:因為電腦只能處理數值的變數，所以每要將non-metric轉換成數值的形式才能夠繼續處理
- metric:可以將metric變數轉換成順序的變數，可能會更有意義

## nominal編碼

將名目尺度轉換成「虛擬變數」(dummify variable)

In [9]:
#運用pandas的get_dummies方法
pd.get_dummies(X, columns=['city','boolean'], prefix_sep='__')
#prefix_sep:與原先行名稱的分隔符號

Unnamed: 0,ordinal_column,quantitative_column,city__london,city__san francisco,city__seattle,city__tokyo,boolean__no,boolean__yes
0,somewhat like,1.0,0,0,0,1,0,1
1,like,11.0,0,0,0,0,1,0
2,somewhat like,-0.5,1,0,0,0,0,0
3,like,10.0,0,0,1,0,1,0
4,somewhat like,,0,1,0,0,1,0
5,dislike,20.0,0,0,0,1,0,1


### 自訂虛化器(dummifier)

也可以像前幾步驟一樣自定義一個虛化器來加入到machine learning pipeline裡面

In [10]:
class CustomDummifier(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, X):
        return pd.get_dummies(X, columns=self.cols)
    
    def fit(self, *_):
        return self

In [11]:
cd = CustomDummifier(cols=['boolean','city'])

cd.fit_transform(X)

Unnamed: 0,ordinal_column,quantitative_column,boolean_no,boolean_yes,city_london,city_san francisco,city_seattle,city_tokyo
0,somewhat like,1.0,0,1,0,0,0,1
1,like,11.0,1,0,0,0,0,0
2,somewhat like,-0.5,0,0,1,0,0,0
3,like,10.0,1,0,0,0,1,0
4,somewhat like,,1,0,0,1,0,0
5,dislike,20.0,0,1,0,0,0,1


## Ordinal編碼

順序尺度要編碼要注意到「順序」是有含意的，所以換成數值時，要按照原先的順序做編排。此方法稱為「標籤編碼器」(label encoding)

In [12]:
#先觀察ordinal欄位有哪些內容
print(X['ordinal_column'])

0    somewhat like
1             like
2    somewhat like
3             like
4    somewhat like
5          dislike
Name: ordinal_column, dtype: object


In [13]:
#利用list對應index的特性，就可以建立順序關係
ordering = ['dislike','somewhat like','like']
print(ordering.index('somewhat like'))

1


In [14]:
#利用lambda將ordinal欄位所有數值做轉換
print(X['ordinal_column'].map(lambda x: ordering.index(x)))

0    1
1    2
2    1
3    2
4    1
5    0
Name: ordinal_column, dtype: int64


### 自訂標籤編碼器
透過自定義的方法可以將其加入到pipeline中

In [15]:
class CustomEncoder(TransformerMixin):
    def __init__(self, col, ordering=None):
        self.ordering = ordering
        self.col = col
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = X[self.col].map(lambda x: self.ordering.index(x))
        return X
    
    def fit(self, *_):
        return self

In [16]:
ce = CustomEncoder(col='ordinal_column', ordering=['dislike','somewhat like','like'])
ce.fit_transform(X)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,1,1.0
1,,no,2,11.0
2,london,,1,-0.5
3,seattle,no,2,10.0
4,san francisco,no,1,
5,tokyo,yes,0,20.0


# 連續變數分箱(binning)

將具有連續性的數值轉換成分類變數可能是有意義的。例如有「年齡」，那「年齡段」(age range)可能會更有用

In [17]:
#利用cut可以做分箱的動作
pd.cut(X['quantitative_column'], bins=3)
#bins:分箱數量，3就代表把所有數值切分成三等種類

0     (-0.52, 6.333]
1    (6.333, 13.167]
2     (-0.52, 6.333]
3    (6.333, 13.167]
4                NaN
5     (13.167, 20.0]
Name: quantitative_column, dtype: category
Categories (3, interval[float64]): [(-0.52, 6.333] < (6.333, 13.167] < (13.167, 20.0]]

In [18]:
#不使用標籤，就只顯示分箱後的結果
pd.cut(X['quantitative_column'], bins=3, labels=False)

0    0.0
1    1.0
2    0.0
3    1.0
4    NaN
5    2.0
Name: quantitative_column, dtype: float64

### 自訂分箱

In [19]:
class CustomCutter(TransformerMixin):
    def __init__(self, col, bins, labels=False):
        self.col = col
        self.bins = bins
        self.labels = labels
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = pd.cut(X[self.col], bins=self.bins, labels=self.labels)
        return X
    
    def fit(self, *_):
        return self

In [20]:
cc = CustomCutter(col='quantitative_column', bins=3)
cc.fit_transform(X)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,0.0
1,,no,like,1.0
2,london,,somewhat like,0.0
3,seattle,no,like,1.0
4,san francisco,no,somewhat like,
5,tokyo,yes,dislike,2.0


## 建立編碼轉換的管線

In [21]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("imputer", imputer), ("dummify", cd), ("encode", ce), ("cut", cc)])
#順序:imputer > 虛擬變數 > 編碼順序行 > 分箱定量

In [22]:
#尚未轉換
X.head(6)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,,no,like,11.0
2,london,,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,
5,tokyo,yes,dislike,20.0


In [23]:
#轉換後
pipe.fit(X)

Pipeline(steps=[('imputer',
                 Pipeline(steps=[('quant',
                                  <__main__.CustomQuantitativeImputer object at 0x000001E610C534F0>),
                                 ('category',
                                  <__main__.CustomCategoryImputer object at 0x000001E610C530A0>)])),
                ('dummify',
                 <__main__.CustomDummifier object at 0x000001E611066670>),
                ('encode',
                 <__main__.CustomEncoder object at 0x000001E61107CC70>),
                ('cut', <__main__.CustomCutter object at 0x000001E61107C8E0>)])

In [24]:
pipe.transform(X)

Unnamed: 0,ordinal_column,quantitative_column,boolean_no,boolean_yes,city_london,city_san francisco,city_seattle,city_tokyo
0,1,0,0,1,0,0,0,1
1,2,1,1,0,0,0,0,1
2,1,0,1,0,1,0,0,0
3,2,1,1,0,0,0,1,0
4,1,1,1,0,0,1,0,0
5,0,2,0,1,0,0,0,1
