# Titanic: Machine Learning from Disaster

**
주어진 feature 기반으로 탑승객의 생존/사망을 예측하는 Classification 문제 임.  
Tensorflow(1.4+)로 학습을 통해 예측 모델을 만드는 방법은 네 가지 정도로 볼 수 있음.
**  

1.가설함수(hypothesis), 손실함수(loss), 가중치(W), 바이어스(b) 등을 정의하고 neural network 를 직접 구현하는 방법  
2.tensorflow.layers 같은 High level API 를 사용하여 neural network 를 구현하는 방법  
3.tensorflow.estimators 같은 High level API를 사용하여 neural network 를 구현하는 방법  
4.tensorflow.keras 같은 High level API를 사용하여 neural network 를 구현하는 방법  

여기서는 세번째, tensorflow.estimator.DNNClassifier API를 사용하여 multi-layer neural network 를 구현한다. 
별로 안 좋은 예인것 같음 (디테일하게 컨트롤 할 수 있으려면 Custom estimator 를 만들어야 할 듯..)  
tensorflow row level, layers, keras 를 사용하는 패턴과 너무 상이함.

확인 할 내용  
- 입력 feature 들에 대한 전처리, 사용할 feature 선택에 따른 차이.
- 가중치 초기화를 random 으로 했을 경우, Xavier/He initializer 를 사용했을 경우 차이.
- 활성화 함수(sigmoid/tanh/relu) 사용에 따른 차이.
- Layer 수, Layer 별 neuron 수에 따른 차이.
- Optimizer 사용에 따른 차이 (GradientDescent, Adam 등)

## Data Fields

  * **Survival** - Survival. 0 = No, 1 = Yes
  * **Pclass** - Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd
  * **Sex** - Sex.
  * **Age** - Age in years.
  * **SibSp** - # of siblings / spouses aboard the Titanic.
  * **Parch** - # of parents / children aboard the Titanic.
  * **Ticket** - Ticket number.
  * **Fare** - Passenger fare.
  * **Cabin** - Cabin number.
  * **Embarked** - Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton
  
  https://www.kaggle.com/c/titanic

In [29]:
# import library
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

## Load Dataset

In [30]:
# train data
train = pd.read_csv("data/train.csv", index_col=["PassengerId"])
print(train.shape)
train.head()

(891, 11)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
# 제출용 test 데이터
test = pd.read_csv("data/test.csv", index_col=["PassengerId"])

print(test.shape)
test.head()

(418, 10)


Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Preprocessing

In [32]:
# Pclass 별 나이 median 값 계산
def median_age(data,Pclass):
    med_age = round(data["Age"][data["Pclass"]==Pclass].median())
    return med_age

### Missing Age Value

In [33]:
# train.loc[(train.Age.isnull()) & (train["Fare"] == 0), "Age"] = 0
# for i in [1,2,3]:
#     train.loc[(train.Age.isnull()) & (train["Pclass"] == i), "Age"] = median_age(train,i)
# train.loc[(train.Age < 1), "Age"] = 1

# print(train.shape)
# train.head()

In [34]:
# test.loc[(test.Age.isnull()) & (test["Fare"] == 0), "Age"] = 0
# for i in [1,2,3]:
#     test.loc[(test.Age.isnull()) & (test["Pclass"] == i), "Age"] = median_age(train,i)
# test.loc[(test.Age < 1), "Age"] = 1

# print(test.shape)
# test.head()

### Missing Age Value

In [35]:
# combine the whole dataset to get the mean values of the total dataset
# (just be careful to not leak data)
combined_df = pd.concat([train, test])

# get mean values per gender
male_mean_age = combined_df[combined_df["Sex"]=="male"]["Age"].mean()
female_mean_age = combined_df[combined_df["Sex"]=="female"]["Age"].mean()
print ("female mean age: %1.0f" %female_mean_age )
print ("male mean age: %1.0f" %male_mean_age )

# fill the nan values 
train.loc[ (train["Sex"]=="male") & (train["Age"].isnull()), "Age"] = male_mean_age
train.loc[ (train["Sex"]=="female") & (train["Age"].isnull()), "Age"] = female_mean_age

test.loc[ (test["Sex"]=="male") & (test["Age"].isnull()), "Age"] = male_mean_age
test.loc[ (test["Sex"]=="female") & (test["Age"].isnull()), "Age"] = female_mean_age

female mean age: 29
male mean age: 31


### Missing Embarked value

In [36]:
train["Embarked"] = train["Embarked"].fillna("S")
test["Embarked"] = test["Embarked"].fillna("S")

### Missinag Cabin Value

In [37]:
train["Cabin"] = train["Cabin"].fillna("X")

print(train.shape)
train.head()

(891, 11)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,X,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,X,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,X,S


In [38]:
test["Cabin"] = test["Cabin"].fillna("X")

print(test.shape)
test.head()

(418, 10)


Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,X,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,X,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,X,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,X,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,X,S


In [39]:
print ("Train")
print (train.isnull().sum() )
print ("-------")
print ("Test")
print (test.isnull().sum() )

Train
Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Cabin       0
Embarked    0
dtype: int64
-------
Test
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        1
Cabin       0
Embarked    0
dtype: int64


### Add New Features

In [40]:
# 가족 수 추가
train["Famille"] = 1+train["SibSp"]+train["Parch"]
# 요금을 log 값으로 변경
train["LogFare"] = train["Fare"].apply(lambda x: np.log(x) if x > 0 else x)

print(train.shape)
train.head()

(891, 13)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Famille,LogFare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,X,S,2,1.981001
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2,4.266662
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,X,S,1,2.070022
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2,3.972177
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,X,S,1,2.085672


In [41]:
# 가족 수 추가
test["Famille"] = 1+test["SibSp"]+test["Parch"]

# train data 'Fare' 의 평균값을 구함
mean_fare = train["Fare"].mean()
print("Fare(Mean) = ${0:.3f}".format(mean_fare))

# test data 에만 Fare 값이 없는 data가 1건 있기 때문에 train data의 평균값을 채워줌
test.loc[pd.isnull(test["Fare"]), "Fare"] = mean_fare
test[pd.isnull(test["Fare"])]

# 요금을 log 값으로 변경
test["LogFare"] = test["Fare"].apply(lambda x: np.log(x) if x > 0 else x)

print(test.shape)
test.head()

Fare(Mean) = $32.204
(418, 12)


Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Famille,LogFare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,X,Q,1,2.05786
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,X,S,2,1.94591
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,X,Q,1,2.270836
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,X,S,1,2.159003
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,X,S,3,2.508582


## Data for train

### Select target(label) value for Training

In [43]:
# 예측(predict) 하려는 필드 선택 
label_name = "Survived"

# 전체 train data에서 결과값 데이터셋 준비
# classification 문제로 처리하기 위해 one-hot encoding 필요
y = train[label_name]

print(y.shape)
y.head()

(891,)


PassengerId
1    0
2    1
3    1
4    1
5    0
Name: Survived, dtype: int64

### Splitting train, cross validation data set

In [100]:
# sampling 80% for train data
train_set = train.sample(frac=0.99, replace=False, random_state=777)
# the other 20% is reserverd for cross validation
cv_set = train.loc[ set(train.index) - set(train_set.index)]

print ("train set shape (%i,%i)"  %train_set.shape)
print ("cv set shape (%i,%i)"   %cv_set.shape)
print ("Check if they have common indexes. The folowing line should be an empty set:")
print (set(train_set.index) & set(cv_set.index))

train set shape (882,13)
cv set shape (9,13)
Check if they have common indexes. The folowing line should be an empty set:
set()


### Instantiating Features

- numeric_column: It defines that the feature will be a float32 number.
- bucketized_column: It defines a feature that will be bucketized. You can define the range of the buckets.
- categorical_column_with_vocabulary_list: As the name says, it basically does a one-hot-encoding for the column using a vocabulary list.
- categorical_column_with_hash_bucket: Similarly, this definition encodes the categorical values using a hash bucket. You define the number of hashes it will have. This is very useful when you don't know the vocabulary but may cause hash collisions.

In [140]:
# defining numeric columns
pclass_feature = tf.feature_column.numeric_column('Pclass')
famille_feature = tf.feature_column.numeric_column('Famille')
fare_feature = tf.feature_column.numeric_column('LogFare')
age_feature = tf.feature_column.numeric_column('Age')

#defining buckets for children, teens, adults and elders.
age_bucket_feature = tf.feature_column.bucketized_column(age_feature,[12,21,60])

#defining a categorical column with predefined values
sex_feature = tf.feature_column.categorical_column_with_vocabulary_list(
    'Sex',['female','male']
)
#defining a categorical columns with dynamic values
embarked_feature =  tf.feature_column.categorical_column_with_hash_bucket(
    'Embarked', 3 
)
cabin_feature =  tf.feature_column.categorical_column_with_hash_bucket(
    'Cabin', 100 
)
sex_embedding =  tf.feature_column.embedding_column(
    categorical_column = sex_feature,
    dimension = 2,
)

# DNN doesn't support categorical with hash bucket
embarked_embedding =  tf.feature_column.embedding_column(
    categorical_column = embarked_feature,
    dimension = 3,
)
cabin_embedding =  tf.feature_column.embedding_column(
    categorical_column = cabin_feature,
    dimension = 300,
)

feature_columns = [ pclass_feature, parch_feature, famille_feature, fare_feature, age_bucket_feature,
                   sex_embedding, embarked_embedding, cabin_embedding ]

In [141]:
# classification 수
n_classes = 2
# learning rate
learning_rate = 0.001
# epoches 
EPOCHS = 100
# batch size (여기서는 전체 데이터를 분할하지 않고 한번에 사용)
BATCH_SIZE = int(len(train)/1)

### Estimator (model)

In [142]:
# optimizer 
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

# estimator
estimator = tf.estimator.DNNClassifier(
                    feature_columns=feature_columns,
                    hidden_units=[256, 512, 512, 256],
                    n_classes=n_classes,
                    optimizer=optimizer,
                    activation_fn=tf.nn.relu,
                    model_dir='model/dnn')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'model/dnn', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001BD8E22D198>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


### Input function

In [143]:
# train input function
train_input_fn = tf.estimator.inputs.pandas_input_fn(
      x=train_set,
      y=train_set.Survived,
      num_epochs=None, #For training it can use how many epochs is necessary
      shuffle=True,
      target_column='target',
)

evaluate_input_fn = tf.estimator.inputs.pandas_input_fn(
      x=cv_set.drop('Survived', axis=1),
      y=cv_set.Survived,
      num_epochs=1, #We just want to use one epoch since this is only to score.
      shuffle=False  #It isn't necessary to shuffle the cross validation 
)

test_input_fn = tf.estimator.inputs.pandas_input_fn(
      x=test,
      y=None,
      num_epochs=1,
      shuffle=False
)

### Train

In [144]:
estimator.train(input_fn=train_input_fn, steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into model/dnn\model.ckpt.
INFO:tensorflow:loss = 90.8456, step = 1
INFO:tensorflow:global_step/sec: 57.2934
INFO:tensorflow:loss = 45.2522, step = 101 (1.749 sec)
INFO:tensorflow:global_step/sec: 68.7205
INFO:tensorflow:loss = 37.5559, step = 201 (1.456 sec)
INFO:tensorflow:global_step/sec: 70.5149
INFO:tensorflow:loss = 35.851, step = 301 (1.419 sec)
INFO:tensorflow:global_step/sec: 68.485
INFO:tensorflow:loss = 33.3954, step = 401 (1.459 sec)
INFO:tensorflow:global_step/sec: 68.7678
INFO:tensorflow:loss = 26.1358, step = 501 (1.453 sec)
INFO:tensorflow:global_step/sec: 66.6575
INFO:tensorflow:loss = 30.9283, step = 601 (1.498 sec)
INFO:tensorflow:global_step/sec: 66.4359
INFO:tensorflow:loss = 29.7535, step = 701 (1.507 sec)
INFO:tensorflow:global_step/sec: 64.3397
INFO:tensorflow:loss = 37.7083, step = 801 (1.557 sec)
INFO:tensorflow:global_step/sec: 65.1792
INFO:tensorflow:loss = 29.1075, step = 9

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1bd8e1e27f0>

### Evaluate (Cross validation)

In [145]:
# evaluate and print the accuracy using the cross-validation input function
accuracy_score = estimator.evaluate(input_fn=evaluate_input_fn)["accuracy"]
print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

INFO:tensorflow:Starting evaluation at 2018-02-28-12:45:23
INFO:tensorflow:Restoring parameters from model/dnn\model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2018-02-28-12:45:25
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.888889, accuracy_baseline = 0.666667, auc = 1.0, auc_precision_recall = 1.0, average_loss = 0.313816, global_step = 1000, label/mean = 0.333333, loss = 2.82434, prediction/mean = 0.448059

Test Accuracy: 0.888889



### Predict

In [146]:
pred = estimator.predict(input_fn=test_input_fn)

predictions = list(pred)
predictions[0]

INFO:tensorflow:Restoring parameters from model/dnn\model.ckpt-1000


{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.07940722], dtype=float32),
 'logits': array([-2.45042849], dtype=float32),
 'probabilities': array([ 0.92059278,  0.07940722], dtype=float32)}

In [147]:
import numpy as np
predicted_labels = np.array([])
for p in predictions:
    predicted_labels = np.append(predicted_labels,p['class_ids'][0])

predicted_labels = predicted_labels.astype(int)

### Make submit file

In [148]:
# 제출용 데이터 생성
d = {'PassengerId': test.index, 'Survived': predicted_labels}
prediction_df = pd.DataFrame(data=d)
print(prediction_df.shape)
prediction_df.head()

(418, 2)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [149]:
# 저장할 파일을 구분하기 위해 파일명에 timestamp 정보 추가 하기 위한 작업 
from datetime import datetime

current_date = datetime.now()
current_date = current_date.strftime("%Y-%m-%d_%H-%M-%S")

description = "titanic-multilayer-nn"

filename = "{date}_{desc}.csv".format(date=current_date, desc=description)
filepath = "data/{filename}".format(filename=filename)

prediction_df.to_csv(filepath, index=False)