# Titanic Survival Prediction

## Approach

**Data preparation**

- Handling missing values
- Handling categorical features

**Train & Tune Model**

- Train model
- Test accuracy
- Tune model parameters

**Run Prediction**
- Update test data set
- Sanity Check

## Part 1: Load Data & Handling Missing Values

In [1]:
# read the Titanic training data
import pandas as pd
path = '../data/'
url = path + 'train.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

In [2]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [3]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

In [5]:
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [7]:
# fill missing values for Embarked with the mode
titanic.Embarked.fillna('S', inplace=True)

In [8]:
# read the Titanic test data
import pandas as pd
path = '../data/'
url = path + 'test.csv'
titanic_test = pd.read_csv(url, index_col='PassengerId')
titanic_test.shape

(418, 10)

In [9]:
# check for missing values
titanic_test.isnull().sum()

Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

In [10]:
titanic_test.Age.fillna(titanic_test.Age.median(), inplace=True)
titanic_test.Fare.fillna(titanic_test.Fare.median(), inplace=True)

## Part 2: Handling categorical features

- **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
- **Unordered categories:** use dummy encoding (0/1)

In [11]:
# Create and encode Female feature
titanic['Female'] = titanic.Sex.map({'male':0, 'female':1})
titanic_test['Female'] = titanic_test.Sex.map({'male':0, 'female':1})

In [12]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [13]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic_test.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, embarked_dummies], axis=1)

# Part 3: Train and Tune the Model

In [15]:
feature_cols = ['Pclass',  'Age', 'Female',  'Embarked_S',  'SibSp', 'Fare', 'Parch']
X_train = titanic[feature_cols]
y_train = titanic.Survived

# import KNN class we need from scikit-learnQ
from sklearn.neighbors import KNeighborsClassifier

# instantiate the estimator 
knn = KNeighborsClassifier(n_neighbors=2, weights='distance') # Tune these parameters!

# run a knn.fit on the data to build the model
knn.fit(X_train, y_train)

titanic['y_pred_class_knn']=knn.predict(X_train)

# Test the accuracy
print knn.score(X_train, y_train)

0.978675645342


# Part 4: Make Predictions

## Update Test Dataset & Create Submission File

In [16]:
X_test = titanic_test[feature_cols]

# make predictions for testing set
titanic_test['Survived'] = knn.predict(X_test)

path = '../data/'
url = path + 'submit.csv'
titanic_test.to_csv(columns = ['Survived'], path_or_buf = url, header=False)

## Sanity Check ##

In [20]:
print titanic.Survived.value_counts() / titanic.Survived.count()
print titanic_test.Survived.value_counts() / titanic_test.Survived.count()

0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    0.595694
1    0.404306
Name: Survived, dtype: float64


In [21]:
titanic_test.shape

(418, 14)