# Titanic: Machine Learning from Disaster

## Predict survival on the Titanic and get familiar with ML basics


The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


## Data Fields

  * **Survival** - Survival. 0 = No, 1 = Yes
  * **Pclass** - Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd
  * **Sex** - Sex.
  * **Age** - Age in years.
  * **SibSp** - # of siblings / spouses aboard the Titanic.
  * **Parch** - # of parents / children aboard the Titanic.
  * **Ticket** - Ticket number.
  * **Fare** - Passenger fare.
  * **Cabin** - Cabin number.
  * **Embarked** - Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton


In [None]:
# import pandas 
import pandas as pd

## Load Dataset

In [None]:
train = pd.read_csv("data/train.csv", index_col=["PassengerId"])
# train data의 row/columns 수 확인
print(train.shape)
# train data top 5 리스트 확인
train.head()

In [None]:
# train data structure 확인
train.info()

In [None]:
# train data 의 수치형 컬럼들에 대한 통계자료 확인
train.describe()

In [None]:
test = pd.read_csv("data/test.csv", index_col=["PassengerId"])

print(test.shape)
test.head()

In [None]:
test.info()

In [None]:
test.describe()

## Preprocessing

### Encode Sex

In [None]:
# 성별을 feature 로 사용하기위해 수치형으로 encoding
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

print(train.shape)
train.head()

In [None]:
# test 데이터를 가지고 predict 를 해야 하기 때문에 train data 와 동일하게 전처리 해야 함
test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1


print(test.shape)
test.head()

### Fill in missing fare

In [None]:
# train data 'Fare' 의 평균값을 구함
mean_fare = train["Fare"].mean()

print("Fare(Mean) = ${0:.3f}".format(mean_fare))

In [None]:
# test data 에만 Fare 값이 없는 data가 1건 있기 때문에 train data의 평균값을 채워줌
test.loc[pd.isnull(test["Fare"]), "Fare"] = mean_fare

test[pd.isnull(test["Fare"])]

### Encode Embarked

In [None]:
# One Hot Encoding : Embarked 컬럼의 값(C,S,Q)을 세게의 컬럼으로 추가 후 수치형(0 or 1) 로 변환
# Boolean 값(True/False) 는 0/1 로 인식하기 때문에 수치형으로 변환 불필요
train["Embarked_C"] = train["Embarked"] == "C"
train["Embarked_S"] = train["Embarked"] == "S"
train["Embarked_Q"] = train["Embarked"] == "Q"

print(train.shape)
train[["Embarked", "Embarked_C", "Embarked_S", "Embarked_Q"]].head()

In [None]:
test["Embarked_C"] = test["Embarked"] == "C"
test["Embarked_S"] = test["Embarked"] == "S"
test["Embarked_Q"] = test["Embarked"] == "Q"

print(test.shape)
test[["Embarked", "Embarked_C", "Embarked_S", "Embarked_Q"]].head()

## Train

In [None]:
# train 할 feature 를 선택
feature_names = ["Pclass", "Sex", "Fare",
                 "Embarked_C", "Embarked_Q", "Embarked_S"]

# 전체 train 데이터에서 실제 train 할 데이터셋(DataFrame) 준비
X_train = train[feature_names]

print(X_train.shape)
X_train.head()

In [None]:
# 예측(predict) 하려는 필드 선택 
label_name = "Survived"

# 전체 train 데이터에서 결과값 데이터셋 준비
y_train = train[label_name]

print(y_train.shape)
y_train.head()

In [None]:
# sikitlearn 에서 DecisionTreeClassifier import
from sklearn.tree import DecisionTreeClassifier

# random_state 값 : DecisionTreeClassifier 내에서 random 값 사용 시 항상 동일한 결과가 나오도록 하기 위함
# random_state 값을 주지 않으면 매번 예측결과가 random 값의 영향을 받아서 좋아진건지, 모델이 좋아서 그런건지 알 수 없음
seed = 37

# 예측 모델 생성 : decision tree
model = DecisionTreeClassifier(max_depth=5,
                               random_state=seed)

In [None]:
# 예측모델에 train 데이터와 결과 데이터를 주고 학습 수행
model.fit(X_train, y_train)

## Predict

In [None]:
# test 데이터셋 준비 (train 데이터셋과 동일한 feature 사용)
X_test = test[feature_names]

print(X_test.shape)
X_test.head()

In [None]:
# 학습이 완료된 model 로 test 데이터셋의 결과값을 예측
prediction = model.predict(X_test)

print(prediction.shape)
prediction[:20]

## Submit

In [None]:
# 제출할 파일을 만들기 위해 템플릿 파일 load
submission = pd.read_csv("data/gender_submission.csv", index_col="PassengerId")

# Survived 컬럼에 예측결과 값으로 업데이트
submission["Survived"] = prediction

print(submission.shape)
submission.head()

In [None]:
# 저장할 파일을 구분하기 위해 파일명에 timestamp 정보 추가 하기 위한 작업 
from datetime import datetime

current_date = datetime.now()
current_date = current_date.strftime("%Y-%m-%d_%H-%M-%S")

description = "titanic-decision-tree"

filename = "{date}_{desc}.csv".format(date=current_date, desc=description)
filepath = "data/{filename}".format(filename=filename)

submission.to_csv(filepath, index=True)