# 資料前處理(Label encoding、 One hot encoding)
這兩個編碼方式的目的是為了將類別 (categorical)或是文字(text)的資料轉換成數字，而讓程式能夠更好的去理解及運算。
> Label encoding : 把每個類別 mapping 到某個整數，不會增加新欄位

> One hot encoding : 為每個類別新增一個欄位，用 0/1 表示是否

![](images/Encoder.PNG)


## Encoding Categorical features (or label)
![](images/Encoding.PNG)


In [26]:
import pandas as pd
import numpy as np
# 套件建議一開始就先呼叫齊全

In [27]:
# 建立Data，列印
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


# 方法一：sklearn - label encoder + onehot encoder
>onehot encoder要用2D array，若維度所以要用reshape(-1,1)<br>
>onehot encoder要數字，若資料文文字要先用label encoder轉數字

In [28]:
from sklearn.preprocessing import LabelEncoder

In [29]:
# 把工具LabelEncoder指定給一個變數encoder
encoder = LabelEncoder()

In [30]:
# 針對欄位"blood"進行轉換。
encoded_Y = encoder.fit_transform(df["blood"])
print(encoded_Y)

[0 2 1 3 2]


In [31]:
# 把轉換過後的資料放回去。
df["blood"] = encoded_Y
df

Unnamed: 0,blood,Y,Z
0,0,high,
1,2,low,
2,1,high,-1196.0
3,3,mid,72.0
4,2,mid,83.0


In [32]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
print(type(df["blood"]))

<class 'pandas.core.series.Series'>


In [33]:
# 轉換成numpy的資料格式
d = np.array(df["blood"])
d

array([0, 2, 1, 3, 2])

In [34]:
# 資料維度：一維
d.shape

(5,)

In [35]:
# d.reshape(-1,1)增加維度
# 轉換資料型態，轉換成 One hot encoding 可用的資料型態
onehot_df = onehot.fit_transform(d.reshape(-1,1))
onehot_df

<5x4 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [36]:
type(onehot_df)

scipy.sparse.csr.csr_matrix

In [37]:
onehot_df = onehot.fit_transform(d.reshape(-1,1)).toarray()
type(onehot_df)

numpy.ndarray

In [38]:
# 階段轉換完成
onehot_df

array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]])

In [16]:
# 思考：上個階段轉換後，若是研究完後要放回或還原，可能相對麻煩，故往下學習其他的方式。

## One hot encoding
One Hot encoding的編碼邏輯為將類別拆成多個行(column)，每個列中的數值由1、0替代，當某一列的資料存在的該行的類別則顯示1，反則顯示0。

然在指定column進行編碼的情形下，One hot encoding<b>無法直接對字串進行編碼，必須先透過Label encoding將字串以數字取代後再進行One hot encoding處理。</b>

> categorical_features = [0]: 表示欲在data上執行One hot encoding的index為0

> data_le: 為經過Label encoding編碼的資料(註:OneHotEncoder的輸入要為2-D array，而Label encoding為1-D array)


OneHotEncoder會轉出scipy.csr_matrix資料結構用.toarray()轉array
從結果可以知道，數字0的column 代表的是A、數字1的column 代表的是B，而數字2的column 代表的是AB。
除了轉換字串外，One hot encoding也可以轉換數字。在此處的data就不需要先經過Label encoding編碼

```python
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
   
# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough') 
  
data = np.array(columnTransformer.fit_transform(data), dtype = str) 
```

In [39]:
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 


# =============================================================

# 以下為去掉remainder='passthrough'釋例，不要執行 

In [21]:
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])])

In [23]:
data = np.array(columnTransformer.fit_transform(data), dtype = str) 
data

array([['0.0', '1.0'],
       ['1.0', '0.0'],
       ['1.0', '0.0'],
       ['1.0', '0.0'],
       ['1.0', '0.0']], dtype='<U32')

# =============================================================

In [40]:
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough')

In [41]:
data = np.array(columnTransformer.fit_transform(df), dtype = str) 
data

array([['1.0', '0.0', '0.0', '0.0', 'high', 'nan'],
       ['0.0', '0.0', '1.0', '0.0', 'low', 'nan'],
       ['0.0', '1.0', '0.0', '0.0', 'high', '-1196.0'],
       ['0.0', '0.0', '0.0', '1.0', 'mid', '72.0'],
       ['0.0', '0.0', '1.0', '0.0', 'mid', '83.0']], dtype='<U7')

In [42]:
# 把資料放回去
data_le = pd.DataFrame(data)
data_le

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


# 以下為講義原稿，不要執行

In [None]:
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough') 
data = np.array(columnTransformer.fit_transform(data), dtype = str) 
data

# 方法二：Keras - label encoder + to_categorical
>to_categorical要數字，若資料文文字要先用label encoder轉數字

註：第一次實作時，因為安裝keras和過去其他課的環境產生問題，故重新安裝和調整，把紀錄留在作業中，當作經驗筆記。

In [46]:
pip install keras

Collecting keras
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: keras
Successfully installed keras-2.6.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.6.0-cp38-cp38-win_amd64.whl (423.2 MB)
Collecting astunparse~=1.6.3
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting gast==0.4.0
  Using cached gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting tensorboard~=2.6
  Using cached tensorboard-2.6.0-py3-none-any.whl (5.6 MB)
Collecting flatbuffers~=1.12.0
  Using cached flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting google-pasta~=0.2
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting h5py~=3.1.0
  Using cached h5py-3.1.0-cp38-cp38-win_amd64.whl (2.7 MB)
Processing c:\users\user\appdata\local\pip\cache\wheels\f1\60\77\22b9b5887bd47801796a856f47650d9789c74dc3161a26d608\clang-5.0-py3-none-any.whl
Processing c:\users\user\appdata\local\pip\cache\wheels\5f\fd\9e\b6cf5890494cb8ef0b5eaff72e5d55a70fb56316007d6dfe73\wrapt-1.12.1-py3-none-any.whl
Collecting keras-preprocessing~=1.1.2
  Using cached Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)


In [2]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils


In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});

In [6]:
# label encoder
# 把工具LabelEncoder指定給一個變數encoder
encoder = LabelEncoder()
# 針對欄位"blood"進行轉換。
encoded_Y = encoder.fit_transform(df["blood"])
print(encoded_Y)
# 把轉換過後的資料放回去。
df["blood"] = encoded_Y
df

[0 2 1 3 2]


Unnamed: 0,blood,Y,Z
0,0,high,
1,2,low,
2,1,high,-1196.0
3,3,mid,72.0
4,2,mid,83.0


In [7]:
# convert integers to one hot encoding
keras_onehot = np_utils.to_categorical(encoded_Y)
keras_onehot

array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]], dtype=float32)

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});

# label encoder 
encoder = LabelEncoder()

# convert integers to one hot encoding




## 方法三：pd.get_dummies方法
![](images/Encoding_pd.PNG)
pd.get_dummies(df)
>get_dummies可以直接轉字串，反而無法轉換數字<br>
>get_dummies沒指定columns，會全部轉換

In [8]:
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]})

In [9]:
df1 = pd.get_dummies(df)
print(df1)

        Z  blood_A  blood_AB  blood_B  blood_O  Y_high  Y_low  Y_mid
0     NaN        1         0        0        0       1      0      0
1     NaN        0         0        1        0       0      1      0
2 -1196.0        0         1        0        0       1      0      0
3    72.0        0         0        0        1       0      0      1
4    83.0        0         0        1        0       0      0      1


In [10]:
df2 = pd.get_dummies(df.blood)
print(df2)

   A  AB  B  O
0  1   0  0  0
1  0   0  1  0
2  0   1  0  0
3  0   0  0  1
4  0   0  1  0


## 練習一：sklearn - label encoder + onehot encoder
下面的資料可以看到country那欄皆為字串， 大部分的模型都是基於數學運算，字串無法套入數學模型進行運算，<br>
在此先對其進行Label encoding編碼，我們從 sklearn library中導入 LabelEncoder class，對第一行資料進行fit及transform並取代之。

In [None]:
import numpy as np
import pandas as pd
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df000 = pd.DataFrame({"country":['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan'], 
                   "age":[25,30,45,35,22,36],
                   "salary":[20000,32000,59000,60000,43000,52000]});
df000

Unnamed: 0,country,age,salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [3]:
from sklearn.preprocessing import LabelEncoder

In [4]:
# 把工具LabelEncoder指定給一個變數encoder
encoder000 = LabelEncoder()

In [6]:
# 針對欄位"country"進行轉換。
encoded000_Y = encoder000.fit_transform(df000["country"])
print(encoded000_Y)

[2 0 1 0 1 2]


In [7]:
# 把轉換過後的資料放回去。
df000["country"] = encoded000_Y
df000

Unnamed: 0,country,age,salary
0,2,25,20000
1,0,30,32000
2,1,45,59000
3,0,35,60000
4,1,22,43000
5,2,36,52000


In [8]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
print(type(df000["country"]))

<class 'pandas.core.series.Series'>


In [9]:
# 轉換成numpy的資料格式
d000 = np.array(df000["country"])
d000

array([2, 0, 1, 0, 1, 2])

In [11]:
onehot000_df = onehot.fit_transform(d000.reshape(-1,1))
onehot000_df

<6x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [12]:
onehot000_df = onehot.fit_transform(d000.reshape(-1,1)).toarray()
type(onehot000_df)

numpy.ndarray

In [13]:
onehot000_df

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [14]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

In [15]:
columnTransformer = ColumnTransformer([('encoder000', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough')

In [16]:
data000 = np.array(columnTransformer.fit_transform(df000), dtype = str) 
data000

array([['0.0', '0.0', '1.0', '25.0', '20000.0'],
       ['1.0', '0.0', '0.0', '30.0', '32000.0'],
       ['0.0', '1.0', '0.0', '45.0', '59000.0'],
       ['1.0', '0.0', '0.0', '35.0', '60000.0'],
       ['0.0', '1.0', '0.0', '22.0', '43000.0'],
       ['0.0', '0.0', '1.0', '36.0', '52000.0']], dtype='<U32')

In [17]:
data000_le = pd.DataFrame(data000)
data000_le

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,1.0,25.0,20000.0
1,1.0,0.0,0.0,30.0,32000.0
2,0.0,1.0,0.0,45.0,59000.0
3,1.0,0.0,0.0,35.0,60000.0
4,0.0,1.0,0.0,22.0,43000.0
5,0.0,0.0,1.0,36.0,52000.0


## 練習二：Keras - label encoder + to_categorical

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

In [18]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

In [19]:
df000 = pd.DataFrame({"country":['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan'], 
                   "age":[25,30,45,35,22,36],
                   "salary":[20000,32000,59000,60000,43000,52000]});
df000

Unnamed: 0,country,age,salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [20]:
# label encoder
# 把工具LabelEncoder指定給一個變數encoder
encoder000 = LabelEncoder()
# 針對欄位"blood"進行轉換。
encoded000_Y = encoder000.fit_transform(df000["country"])
print(encoded000_Y)
# 把轉換過後的資料放回去。
df000["country"] = encoded000_Y
df000

[2 0 1 0 1 2]


Unnamed: 0,country,age,salary
0,2,25,20000
1,0,30,32000
2,1,45,59000
3,0,35,60000
4,1,22,43000
5,2,36,52000


In [21]:
# convert integers to one hot encoding
keras_onehot = np_utils.to_categorical(encoded000_Y)
keras_onehot

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

## 練習三：Pandas.get_dummies
>　get_dummies : 僅能將字串轉換為One hot encoding表示形式， 沒指定columns會全部轉換。

In [22]:
df000 = pd.DataFrame({"country":['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan'], 
                   "age":[25,30,45,35,22,36],
                   "salary":[20000,32000,59000,60000,43000,52000]});
df000

Unnamed: 0,country,age,salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [23]:
df001 = pd.get_dummies(df000)
print(df001)

   age  salary  country_Australia  country_Ireland  country_Taiwan
0   25   20000                  0                0               1
1   30   32000                  1                0               0
2   45   59000                  0                1               0
3   35   60000                  1                0               0
4   22   43000                  0                1               0
5   36   52000                  0                0               1


In [24]:
df002 = pd.get_dummies(df000.country)
print(df002)

   Australia  Ireland  Taiwan
0          0        0       1
1          1        0       0
2          0        1       0
3          1        0       0
4          0        1       0
5          0        0       1


In [None]:
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data