# Titanic Survial Classification Problem
- Source : https://www.kaggle.com/c/titanic/data

### Background
- On April 15, 1912, Titanic sank after colliding with an iceberg
- Resulting in the death of 1502 out of 2224 passengers and crew
- While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others

### Goal 
- Classify passengers who survived using features in the dataset

## 1. Load Packages : Set up an environment

In [1]:
import numpy as np
import pandas as pd

## 2. Load Titanic Data

In [2]:
data = pd.read_csv('./train.csv')

In [3]:
print(data.shape)
print(data.describe(include='all'))

(891, 12)
        PassengerId    Survived      Pclass  \
count    891.000000  891.000000  891.000000   
unique          NaN         NaN         NaN   
top             NaN         NaN         NaN   
freq            NaN         NaN         NaN   
mean     446.000000    0.383838    2.308642   
std      257.353842    0.486592    0.836071   
min        1.000000    0.000000    1.000000   
25%      223.500000    0.000000    2.000000   
50%      446.000000    0.000000    3.000000   
75%      668.500000    1.000000    3.000000   
max      891.000000    1.000000    3.000000   

                                           Name   Sex         Age       SibSp  \
count                                       891   891  714.000000  891.000000   
unique                                      891     2         NaN         NaN   
top     Ford, Mrs. Edward (Margaret Ann Watson)  male         NaN         NaN   
freq                                          1   577         NaN         NaN   
mean                

In [4]:
data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## 3. Understand the data
- Features that are not informative in ML model : "PassengerId", "Name" (not sure : "Ticket")
- Target(Outcome) Variable : "Survived"
- Features(Input) : "Sex", "Age", "SibSp", "Parch", "Pclass" (not sure : "Fare", "Cabin", "Embarked")

In [5]:
print(sum(data.Ticket.isna()))
print(len(data.Ticket.unique()))

0
681


In [6]:
data.loc[data.Ticket=="349909"]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S
374,375,0,3,"Palsson, Miss. Stina Viola",female,3.0,3,1,349909,21.075,,S
567,568,0,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S


In [7]:
data.loc[data.Ticket=="347742"]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
172,173,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S


## 4. Preprocessing

### (1) Data Conversion
- Category to Numeric : male -> 0, female -> 1

In [8]:
binary_vals = { "Sex": {"male": 0, "female": 1}}

In [9]:
data.replace(binary_vals, inplace=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


In [10]:
## Pclass : Depending on considering it as Ordinal? Nominal?
# print data.Pclass.unique()
# pd.get_dummies(data, columns=["Pclass"]).head()

### (2) Divide data into Training / Test sets

In [11]:
X = data[["Sex", "Age", "SibSp", "Parch", "Pclass"]]
y = data["Survived"]

In [51]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

### (3) Missing Imputation

In [52]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = X_train.copy(); X_test = X_test.copy()
X_train.loc[:,'Age'] = imp.fit_transform(X_train.loc[:,['Age']])
X_test.loc[:,'Age'] = imp.transform(X_test.loc[:,['Age']])

## 5. Train Classification Models

### (1) Classification Tree
- General Description of Tree : https://scikit-learn.org/stable/modules/tree.html
- Documentation of DecisionTreeClassifier : https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
model_1 = DecisionTreeClassifier()
model_1 = model_1.fit(X_train, y_train)

In [15]:
model_1.

SyntaxError: invalid syntax (<ipython-input-15-a6c934a0f58f>, line 1)

In [None]:
model_1.predict()

In [None]:
import sklearn
sklearn.__version__