# DAY 22

***
## (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

## [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察標籤編碼與獨編碼熱的影響

## [作業重點]
- 回答在範例中的觀察結果
- 觀察標籤編碼與獨熱編碼, 在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 (In[3], Out[3], In[4], Out[4])
***
## [參考資料]

**數據預處理：獨熱編碼（One-Hot Encoding）和 LabelEncoder標籤編碼: https://www.twblogs.net/a/5baab6e32b7177781a0e6859/zh-cn/**

- 其實 One Hot Encoding 與 Label Encoder 是類別型資料最常見的編碼方式，因此實現的程式碼也頗為常用，其中 One Hot Encoding 常見的兩種做法 : pandas.get_dummies 與 sklearn 的 OneHotEncoder 在這網頁中都有清楚的展示，本課程今日範例中會用到前者，在之後的葉編碼中則會用到後者，所以同學不妨先了解一下寫法。

In [1]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

data_path = 'data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
#只取類別值 (object) 型欄位, 存於 object_features 中
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'object':
        object_features.append(feature)
print(f'{len(object_features)} Numeric Features : {object_features}\n')

# 只留類別型欄位
df = df[object_features]
df = df.fillna('None')
train_num = train_Y.shape[0]
df.head()

5 Numeric Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']



Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


## 作業
**鐵達尼號例題中，標籤編碼 / 獨熱編碼又分別對預測結果有何影響? (Hint : 參考今日範例)**

In [3]:
# 標籤編碼 (LabelEncoder) + 羅吉斯迴歸 (LogisticRegression)
df_le = copy.deepcopy(df)
le = LabelEncoder()
print(f'Original df_le : {df_le.shape}')
print(f'Original df_le : \n {df_le.head()} \n')

for col in df_le.columns:
    df_le[col] = le.fit_transform(df_le[col])
    
print(f'After LE of df_le : {df_le.shape}')
print(f'After df_le : \n {df_le.head()} \n')

train_X_le = df_le[:train_num]
print(f'train_X_le : {train_X_le.shape}')
estimator = LogisticRegression(solver = 'lbfgs')
start = time.time()
print(f'Score of LabelEncoder : {cross_val_score(estimator, train_X_le, train_Y, cv=5).mean()}')
duration_le = time.time() - start
print(f'Duration of LabelEncoder : { duration_le * 1000} ms')

Original df_le : (1309, 5)
Original df_le : 
                                                 Name     Sex  \
0                            Braund, Mr. Owen Harris    male   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
2                             Heikkinen, Miss. Laina  female   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
4                           Allen, Mr. William Henry    male   

             Ticket Cabin Embarked  
0         A/5 21171  None        S  
1          PC 17599   C85        C  
2  STON/O2. 3101282  None        S  
3            113803  C123        S  
4            373450  None        S   

After LE of df_le : (1309, 5)
After df_le : 
    Name  Sex  Ticket  Cabin  Embarked
0   155    1     720    185         3
1   286    0     816    106         0
2   523    0     914    185         3
3   422    0      65     70         3
4    22    1     649    185         3 

train_X_le : (891, 5)
Score of LabelEncoder : 0.7789000729487724
Dura

In [4]:
# 獨熱編碼 (One Hot Encoder) + 羅吉斯迴歸 (LogisticRegression)
df_temp = copy.deepcopy(df)

print(f'Original df_temp : {df_temp.shape}')
print(f'Original df_temp : \n {df_temp.head()} \n')

df_ohe = pd.get_dummies(df_temp)
print(f'After OHE of df_ohe : {df_ohe.shape}')
print(f'After df_ohe : \n {df_ohe.head()} \n')

train_X_ohe = df_ohe[:train_num]
print(f'train_X_ohe : {train_X_ohe.shape}')
estimator = LogisticRegression(solver = 'lbfgs')
start = time.time()
print(f'Score of OneHotEncoder : {cross_val_score(estimator, train_X_ohe, train_Y, cv=5).mean()}')
duration_ohe = time.time() - start
print(f'Duration of OneHotEncoder : { duration_ohe * 1000} ms')

Original df_temp : (1309, 5)
Original df_temp : 
                                                 Name     Sex  \
0                            Braund, Mr. Owen Harris    male   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
2                             Heikkinen, Miss. Laina  female   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
4                           Allen, Mr. William Henry    male   

             Ticket Cabin Embarked  
0         A/5 21171  None        S  
1          PC 17599   C85        C  
2  STON/O2. 3101282  None        S  
3            113803  C123        S  
4            373450  None        S   

After OHE of df_ohe : (1309, 2429)
After df_ohe : 
    Name_Abbing, Mr. Anthony  Name_Abbott, Master. Eugene Joseph  \
0                         0                                   0   
1                         0                                   0   
2                         0                                   0   
3                    

In [6]:
print(f'train_X_le : {train_X_le.shape}')
print(f'train_X_ohe : {train_X_ohe.shape}')
print(f'Score of LabelEncoder : {cross_val_score(estimator, train_X_le, train_Y, cv=5).mean()}')
print(f'Score of OneHotEncoder : {cross_val_score(estimator, train_X_ohe, train_Y, cv=5).mean()}')
print(f'Duration of LabelEncoder : { duration_le * 1000} ms')
print(f'Duration of OneHotEncoder : { duration_ohe * 1000} ms')

train_X_le : (891, 5)
train_X_ohe : (891, 2429)
Score of LabelEncoder : 0.7789000729487724
Score of OneHotEncoder : 0.8013346043513216
Duration of LabelEncoder : 73.28152656555176 ms
Duration of OneHotEncoder : 488.2171154022217 ms


**As above result, we can find that:**
- OneHotEncoder created **more columns** than LabelEncoder
- OneHotEncoder had **higher score** than LabelEncoder
- OneHotEncoder **spent more time** to execute than LabelEncoder