# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察均值編碼的效果

# [作業重點]
- 仿造範例, 完成標籤編碼與均值編碼搭配邏輯斯迴歸的預測
- 觀察標籤編碼與均值編碼在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 (In[3], Out[3], In[4], Out[4]) 

# 作業1
* 請仿照範例，將鐵達尼範例中的類別型特徵改用均值編碼實作一次

In [1]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

data_path = '../data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
#只取類別值 (object) 型欄位, 存於 object_features 中
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'object':
        object_features.append(feature)
print(f'{len(object_features)} Numeric Features : {object_features}\n')

# 只留類別型欄位
df = df[object_features]
df = df.fillna('None')
train_num = train_Y.shape[0]
df.info()

5 Numeric Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 5 columns):
Name        1309 non-null object
Sex         1309 non-null object
Ticket      1309 non-null object
Cabin       1309 non-null object
Embarked    1309 non-null object
dtypes: object(5)
memory usage: 61.4+ KB


# 作業2
* 觀察鐵達尼生存預測中，均值編碼與標籤編碼兩者比較，哪一個效果比較好? 可能的原因是什麼?

In [3]:
# 對照組 : 標籤編碼 + 邏輯斯迴歸
df_temp = pd.DataFrame()
for c in df.columns:
    df_temp[c] = LabelEncoder().fit_transform(df[c])
train_X = df_temp[:train_num]
estimator = LogisticRegression(solver='lbfgs')
score=cross_val_score (estimator, train_X, train_Y, scoring = 'accuracy', cv=5) 
print("score=", score, "mean=", score.mean(), "train_num=", train_num)
train_X[:].nunique()

score= [0.7877095  0.77653631 0.78089888 0.76404494 0.78531073] mean= 0.7789000729487724 train_num= 891


Name        891
Sex           2
Ticket      681
Cabin       148
Embarked      4
dtype: int64

In [4]:
# 均值編碼 + 邏輯斯迴歸
data = pd.concat([df[:train_num], train_Y], axis =1)
print ("data columns of data = ", data.columns)
print ("data columns of df=", df.columns)
for c in df.columns:
    mean_df = data.groupby([c])['Survived'].mean().reset_index()
    mean_df.columns = [c, f'{c}_mean']
    #print (mean_df)
    data = pd.merge (data, mean_df, on = c, how = 'left')
    data = data.drop([c], axis =1)
#print(data)    
data = data.drop(['Survived'], axis =1) 
#print(data.head())
score = cross_val_score (estimator, data, train_Y, scoring = 'accuracy', cv = 5)
print("score=", score, "mean=", score.mean())

data columns of data =  Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Survived'], dtype='object')
data columns of df= Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
score= [1. 1. 1. 1. 1.] mean= 1.0


1. Beacuse unique name number is equal to the number of training number, they will get over-fitting results.
2. The model will fail to the prediction because we don't have the "Survived" column to do the transformation. 
3. For test data, we don't have 'Survived" column. How can we do the mean encoder?

In [5]:
data[:].nunique()

Name_mean        2
Sex_mean         2
Ticket_mean      8
Cabin_mean       6
Embarked_mean    4
dtype: int64

In [6]:
#data = data.drop(['Name_mean','Ticket_mean'] , axis =1) 
data = data.drop(['Name_mean'], axis = 1)
print(data.head())
score = cross_val_score (estimator, data, train_Y, scoring = 'accuracy', cv = 5)
print("score=", score, "mean=", score.mean())
#df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
#df_test = df_test.drop(['PassengerId'] , axis=1)

   Sex_mean  Ticket_mean  Cabin_mean  Embarked_mean
0  0.188908          0.0    0.299854       0.336957
1  0.742038          1.0    1.000000       0.553571
2  0.742038          1.0    0.299854       0.336957
3  0.742038          0.5    0.500000       0.336957
4  0.188908          0.0    0.299854       0.336957
score= [0.98882682 0.95530726 0.98314607 0.96629213 0.98305085] mean= 0.9753246255834217
