# Titanic: Machine Learning from Disaster
### Predict survival on the Titanic
* Defining the problem statement
* Collecting the data
* Exploratory data analysis
* Feature engineering
* Modelling
* Testing

## 1. Defining the problem statement
Complete the analysis of what sorts of people were likely to survive.  
In particular, we ask you to apply the tools of machine learning to predict which passengers survived the Titanic tragedy.

In [79]:
from IPython.display import Image
Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/5095eabce4b06cb305058603/5095eabce4b02d37bef4c24c/1352002236895/100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8.jpg")

## 2. Collecting the data
[kaggle titanic datasets](https://www.kaggle.com/c/titanic/data)

### load datasets using pandas

In [80]:
import pandas as pd
import numpy as np

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

## 3. Exploratory data analysis

In [81]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Data Dictionary
|Variable|Definition|Key|
|:-:|:-:|:-:|
|Survived|Survival|0 = No, 1 = Yes|
|Pclass|Ticket Class|1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex|Sex||	
|Age|Age in years||	
|SibSp|# of siblings / spouses aboard the Titanic||	
|Parch|# of parents / children aboard the Titanic||	
|Ticket|Ticket number||	
|Fare|Passenger fare||	
|Cabin|Cabin number||	
|Embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

### Total rows and columns

In [82]:
print(train.shape, test.shape)

(891, 12) (418, 11)


## 4. Feature Engineering

### Correcting Null values

In [83]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### Age

In [84]:
train['Age_Mean'] = train['Age'].fillna(train['Age'].mean())
train['Age_Mean'].isnull().sum()

0

In [85]:
test['Age_Mean'] = test['Age'].fillna(test['Age'].mean())
test['Age_Mean'].isnull().sum()

0

### Creating new features

#### Gender

In [86]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [87]:
train['Sex'] == 'Female'

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Sex, Length: 891, dtype: bool

In [88]:
train['Gender'] = (train['Sex'] == 'female')
test['Gender'] = (test['Sex'] == 'female')

In [89]:
print(train['Gender'].value_counts()) 
print(test['Gender'].value_counts())

False    577
True     314
Name: Gender, dtype: int64
False    266
True     152
Name: Gender, dtype: int64


#### Embarked

In [90]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [91]:
test['Embarked'].value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

In [92]:
train['Embarked_S'] = train['Embarked'] == 'S'
train['Embarked_C'] = train['Embarked'] == 'C'
train['Embarked_Q'] = train['Embarked'] == 'Q'

test['Embarked_S'] = test['Embarked'] == 'S'
test['Embarked_C'] = test['Embarked'] == 'C'
test['Embarked_Q'] = test['Embarked'] == 'Q'

In [93]:
print(train['Embarked_S'].sum(), train['Embarked_C'].sum(), train['Embarked_Q'].sum())
print(test['Embarked_S'].sum(), test['Embarked_C'].sum(), test['Embarked_Q'].sum())

644 168 77
270 102 46


#### Family
|Family Size|Family Type|
|:-:|:-:|
|FamilySize(x) == 1|S|
|2 <= FamilySize(x) < 5|M|
|FamilySize(x) >= 5|L|

In [94]:
# Family for train data
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
train['Family'] = train['FamilySize']

train.loc[train['FamilySize'] == 1, 'Family'] = 'S'
train.loc[(train['FamilySize'] >= 2) & (train['FamilySize'] < 5), 'Family'] = 'M'
train.loc[train['FamilySize'] >= 5, 'Family'] = 'L'

# Family for test data
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
test['Family'] = test['FamilySize']

test.loc[test['FamilySize'] == 1, 'Family'] = 'S'
test.loc[(test['FamilySize'] >= 2) & (test['FamilySize'] < 5), 'Family'] = 'M'
test.loc[test['FamilySize'] >= 5, 'Family'] = 'L'

In [96]:
train[['FamilySize', 'Family']].head()

Unnamed: 0,FamilySize,Family
0,2,M
1,2,M
2,1,S
3,2,M
4,1,S


In [97]:
test[['FamilySize', 'Family']].head()

Unnamed: 0,FamilySize,Family
0,1,S
1,2,M
2,1,S
3,1,S
4,3,M


In [99]:
print(train['Family'].value_counts()) 
print(test['Family'].value_counts())

S    537
M    292
L     62
Name: Family, dtype: int64
S    253
M    145
L     20
Name: Family, dtype: int64


In [106]:
train['Family_S'] = train['Family'] == 'S'
train['Family_M'] = train['Family'] == 'M'
train['Family_L'] = train['Family'] == 'L'

test['Family_S'] = test['Family'] == 'S'
test['Family_M'] = test['Family'] == 'M'
test['Family_L'] = test['Family'] == 'L'

#### Pclass to categorical data

In [100]:
train['Pclass'] = train['Pclass'].astype('category')
train['Pclass']

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: Pclass, Length: 891, dtype: category
Categories (3, int64): [1, 2, 3]

## 5. Modeling

문제: x_train, 정답: y_label -> 모델링(의사결정트리알고리즘) => 모델  
테스트 입력 데이터: x_test ----------------------------> 입력 => 생존여부 출력

In [110]:
#독립변수 (입력데이터)
fn = ['Gender', 'Age_Mean', 'Embarked_S', 'Embarked_C', 'Embarked_Q', 'Family_S', 'Family_M', 'Family_L'] 

# 문제
x_train = train[fn] 

# 정답, 종속변수 (출력데이터)
y_label = train['Survived'] 

# 테스트 입력 데이터
x_test = test[fn]

In [123]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=2020)
model.fit(x_train, y_label)

prediction = model.predict(x_test)
prediction.shape

(418,)

## 6. Testing

In [129]:
test['Survived'] = prediction
result = test[['PassengerId', 'Survived']]
result.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [131]:
result.to_csv("submission.csv", index=False)
pd.read_csv("submission.csv").head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


## Result
`Score`: 0.77272