# DAY 17

## (Kaggle)鐵達尼生存預測精簡版 
https://www.kaggle.com/c/titanic

## [作業目標]
試著不依賴說明, 只依照下列程式碼回答下列問題, 初步理解什麼是"特徵工程"的區塊
- 下列A~E五個程式區塊中，哪一塊是特徵工程? **Ans: Block C**
- 對照程式區塊 B 與 C 的結果，請問那些欄位屬於"類別型欄位"? (回答欄位英文名稱即可) **Ans: Pclass, Name, Sex, Ticket, Cabin, Embarked**
- 續上題，請問哪個欄位是"目標值"? **Ans: 'Survived'**

## [參考資料]
**1. 知乎-特徵工程到底是什麼: https://www.zhihu.com/question/29316149**
- 本文重點為右圖, 主要是希望同學大致知道特徵工程大致包含哪些部分, 若對細節有興趣, 還可以從這篇中了解一些概念其中一部分的內容, 會在後面的課程中說明並練習, 詳情請參閱百日馬拉松課綱。

**2. 痞客幫-iT邦2019鐵人賽 : 為什麼特徵工程很重要: https://ithelp.ithome.com.tw/articles/10200041?sc=iThelpR**
- 本文主要在描述現實中資料科學工作的時間比重(下圖), 其中大部分的時間在於資料清理, 少部分為特徵探勘, 雖然這兩部份都是特徵工程, 但在學習階段與實務階段, 比重卻有著天壤之別。

**3. https://ithelp.ithome.com.tw/articles/10200327**

In [45]:
# 程式區塊 A: Initiate and read data
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

data_path = 'data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')
df_train.shape

(891, 12)

In [46]:
# 程式區塊 B: Extract our target 'Survived' and remove it from source data, then combine source data (df_train, df_test)
train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test]) # Combine train/test data together to do feature engineering
print(df.shape)
df.head()

(1309, 10)


Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**目前範例解答只把資料型態為 object 的欄位列為類別型資料，但這樣的做法並不夠嚴謹。因為實際上不會拿「等級」來做運算，所以 pclass 應該也可以算是類別型資料**

In [22]:
# 程式區塊 C: To find object items, transforming by LabelEncoder, and MinMaxEncoder it
LEncoder = LabelEncoder()
MMEncoder = MinMaxScaler()
for c in df.columns:
    df[c] = df[c].fillna(-1)
    if df[c].dtype == 'object':
        df[c] = LEncoder.fit_transform(list(df[c].values))
    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,0.118683,1.0,0.283951,0.125,0.0,0.775862,0.016072,0.0,1.0
1,0.0,0.218989,0.0,0.481481,0.125,0.0,0.87931,0.140813,0.575269,0.333333
2,1.0,0.400459,0.0,0.333333,0.0,0.0,0.984914,0.017387,0.0,1.0
3,0.0,0.323124,0.0,0.444444,0.125,0.0,0.070043,0.10539,0.38172,1.0
4,1.0,0.016845,1.0,0.444444,0.0,0.0,0.699353,0.01763,0.0,1.0


In [37]:
# 程式區塊 D: Divide df into train_X and test_X, after feature engineering
train_num = train_Y.shape[0]
print("train_num: ", train_num)
train_X = df[:train_num]
print("train_X:", train_X.shape)
test_X = df[train_num:]
print("test_X:", test_X.shape)

from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(solver = 'lbfgs')
estimator.fit(train_X, train_Y)
pred = estimator.predict(test_X)

train_num:  891
train_X: (891, 10)
test_X: (418, 10)


In [6]:
# 程式區塊 E
sub = pd.DataFrame({'PassengerId': ids, 'Survived': pred})
sub.to_csv('titanic_baseline.csv', index=False) 