# 資料前處理(Label encoding、 One hot encoding)
這兩個編碼方式的目的是為了將類別 (categorical)或是文字(text)的資料轉換成數字，而讓程式能夠更好的去理解及運算。
> Label encoding : 把每個類別 mapping 到某個整數，不會增加新欄位

> One hot encoding : 為每個類別新增一個欄位，用 0/1 表示是否

![](images/Encoder.PNG)


## Encoding Categorical features (or label)
![](images/Encoding.PNG)


In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


# 方法一：sklearn - label encoder + onehot encoder
>onehot encoder要用2D array，若維度所以要用reshape(-1,1)<br>
>onehot encoder要數字，若資料文文字要先用label encoder轉數字

In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder_y = encoder.fit_transform(df['blood'])
df['blood'] = encoder_y
print(encoder_y)
df

[0 2 1 3 2]


Unnamed: 0,blood,Y,Z
0,0,high,
1,2,low,
2,1,high,-1196.0
3,3,mid,72.0
4,2,mid,83.0


In [12]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()



d = np.array(df['blood'])
print(type(df['blood']))

d.shape

onehot_df= onehot.fit_transform(d.reshape(-1,1)).toarray()

onehot_df


<class 'pandas.core.series.Series'>


array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]])

## One hot encoding
One Hot encoding的編碼邏輯為將類別拆成多個行(column)，每個列中的數值由1、0替代，當某一列的資料存在的該行的類別則顯示1，反則顯示0。

然在指定column進行編碼的情形下，One hot encoding<b>無法直接對字串進行編碼，必須先透過Label encoding將字串以數字取代後再進行One hot encoding處理。</b>

> categorical_features = [0]: 表示欲在data上執行One hot encoding的index為0

> data_le: 為經過Label encoding編碼的資料(註:OneHotEncoder的輸入要為2-D array，而Label encoding為1-D array)


OneHotEncoder會轉出scipy.csr_matrix資料結構用.toarray()轉array
從結果可以知道，數字0的column 代表的是A、數字1的column 代表的是B，而數字2的column 代表的是AB。
除了轉換字串外，One hot encoding也可以轉換數字。在此處的data就不需要先經過Label encoding編碼

```python
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
   
# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough') 
  
data = np.array(columnTransformer.fit_transform(data), dtype = str) 
```

In [27]:
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])],
                                     remainder='passthrough')

data = np.array(columnTransformer.fit_transform(df), dtype =str)

data

data_le = pd.DataFrame(data)
data_le

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


# 方法二：Keras - label encoder + to_categorical
>to_categorical要數字，若資料文文字要先用label encoder轉數字

In [30]:
conda update conda

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/cherylhaung/opt/anaconda3

  added / updated specs:
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.12.0               |   py38hecd8cb5_0        14.5 MB
    conda-package-handling-1.8.0|   py38hca72f7f_0         1.5 MB
    xmltodict-0.12.0           |     pyhd3eb1b0_0          13 KB
    ------------------------------------------------------------
                                           Total:        16.0 MB

The following packages will be UPDATED:

  conda                               4.10.3-py38hecd8cb5_0 --> 4.12.0-py38hecd8cb5_0
  conda-package-han~                   1.7.3-py38h9ed2024_1 --> 1.8.0-py38hca72f7f_0

The following packages will be DOWNGRADED:

  xmltodict                                     0.12.0-py_0 --> 0.12.0

In [33]:
pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.8.0-cp38-cp38-macosx_10_14_x86_64.whl (217.4 MB)
[K     |████████████████████████████████| 217.4 MB 1.3 MB/s eta 0:00:014  |▎                               | 2.1 MB 891 kB/s eta 0:04:02     |██████                          | 40.7 MB 7.2 MB/s eta 0:00:25     |██████▎                         | 42.6 MB 7.2 MB/s eta 0:00:25     |██████████████████████████████▌ | 207.1 MB 7.9 MB/s eta 0:00:02
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 5.9 MB/s eta 0:00:01
[?25hCollecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.24.0-cp38-cp38-macosx_10_14_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 5.4 MB/s eta 0:00:01
[?25hCollecting gast>=0.2.1
  Downloading gast-0.5.

Note: you may need to restart the kernel to use updated packages.


In [35]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});

# label encoder 
encoder = LabelEncoder()
encoder_y = encoder.fit_transform(df['blood'])
df['blood'] = encoder_y
print(encoder_y)
df

# convert integers to one hot encoding

keras_onehot = np_utils.to_categorical(encoder_y)

keras_onehot

[0 2 1 3 2]


array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]], dtype=float32)

## 方法三：pd.get_dummies方法
![](images/Encoding_pd.PNG)
pd.get_dummies(df)
>get_dummies可以直接轉字串，反而無法轉換數字<br>
>get_dummies沒指定columns，會全部轉換

In [38]:
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]})


df1 = pd.get_dummies(df)
df1

df2 = pd.get_dummies(df.blood)
df2

Unnamed: 0,A,AB,B,O
0,1,0,0,0
1,0,0,1,0
2,0,1,0,0
3,0,0,0,1
4,0,0,1,0


## 練習一：sklearn - label encoder + onehot encoder
下面的資料可以看到country那欄皆為字串， 大部分的模型都是基於數學運算，字串無法套入數學模型進行運算，<br>
在此先對其進行Label encoding編碼，我們從 sklearn library中導入 LabelEncoder class，對第一行資料進行fit及transform並取代之。

In [46]:
import numpy as np
import pandas as pd
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data


# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])],
                                     remainder='passthrough')

data = np.array(columnTransformer.fit_transform(data), dtype =str)

data

data_le = pd.DataFrame(data)
data_le

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,1.0,25.0,20000.0
1,1.0,0.0,0.0,30.0,32000.0
2,0.0,1.0,0.0,45.0,59000.0
3,1.0,0.0,0.0,35.0,60000.0
4,0.0,1.0,0.0,22.0,43000.0
5,0.0,0.0,1.0,36.0,52000.0


## 練習二：Keras - label encoder + to_categorical

In [51]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils



country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data1=pd.DataFrame(dic)
data1



# label encoder 
encoder = LabelEncoder()
encoder_c = encoder.fit_transform(dic['Country'])
dic['Country'] = encoder_c
print(encoder_c)
data1

# convert integers to one hot encoding

keras_onehot = np_utils.to_categorical(encoder_c)

keras_onehot

[2 0 1 0 1 2]


array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

## 練習三：Pandas.get_dummies
>　get_dummies : 僅能將字串轉換為One hot encoding表示形式， 沒指定columns會全部轉換。

In [52]:
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data1=pd.DataFrame(dic)
data1





data2 = pd.get_dummies(data1)
data2



Unnamed: 0,Age,Salary,Country_Australia,Country_Ireland,Country_Taiwan
0,25,20000,0,0,1
1,30,32000,1,0,0
2,45,59000,0,1,0
3,35,60000,1,0,0
4,22,43000,0,1,0
5,36,52000,0,0,1
