# One-Hot Encoding 獨熱碼
- one-hot encoding是用來向量表示字詞的方式，特點是每一個字詞之間皆為正交(orthogonal)
- 字詞向量正交，才能排除字詞之間互相影響
- one-hot的優點也是缺點，不同詞的向量正交，無法衡量不同詞之間的關係
- 只能反映某個詞是否在句中出現，無法衡量詞的重要性
- one-hot是高維稀疏矩陣，浪費計算能力與儲存空間
- 本範例示範如何把國家表示成one-hot encoding表示法

In [None]:
import numpy as np      #各國城市年紀與工資
import pandas as pd
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

Unnamed: 0,Country,Age,Salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


## Label encoding
利用LabelEcoder函數，將Country欄位進行Label encoding編碼<br>
Australia：0<br>
Ireland：1<br>
Taiwan：2<br>

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data_le=pd.DataFrame(dic)
data_le['Country'] = labelencoder.fit_transform(data['Country'])
data_le

Unnamed: 0,Country,Age,Salary
0,2,25,20000
1,0,30,32000
2,1,45,59000
3,0,35,60000
4,1,22,43000
5,2,36,52000


## One hot encoding
利用OneHotEcoder函數，將Country欄位進行Ont-Hot encoding編碼
<table><tr><td>Australia(0)</td><td>Ireland(1)</td><td>Taiwan(2)</td></tr></table>

In [None]:
from sklearn.preprocessing import OneHotEncoder  #來自各城市小數點
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = 'passthrough')
data_str_ohe=ct.fit_transform(data_le)
pd.DataFrame(data_str_ohe)

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,1.0,25.0,20000.0
1,1.0,0.0,0.0,30.0,32000.0
2,0.0,1.0,0.0,45.0,59000.0
3,1.0,0.0,0.0,35.0,60000.0
4,0.0,1.0,0.0,22.0,43000.0
5,0.0,0.0,1.0,36.0,52000.0


## Pandas.get_dummies

In [None]:
data_dum = pd.get_dummies(data)
pd.DataFrame(data_dum)

Unnamed: 0,Age,Salary,Country_Australia,Country_Ireland,Country_Taiwan
0,25,20000,0,0,1
1,30,32000,1,0,0
2,45,59000,0,1,0
3,35,60000,1,0,0
4,22,43000,0,1,0
5,36,52000,0,0,1
