# Titanic Survival Prediction

## Approach

**Data preparation**

- Handling missing values
- Handling categorical features

**Train & Tune Model**

- Train model
- Test accuracy
- Tune model parameters

**Make Prediction**
- Update test data set
- Sanity Check

## Part 1: Load Data & Handling Missing Values

In [162]:
# read the Titanic training data
import pandas as pd
path = '../data/'
url = path + 'train.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

In [163]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [164]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

In [165]:
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [166]:
# fill missing values for Embarked with the mode
titanic.Embarked.fillna('S', inplace=True)

In [167]:
# read the Titanic test data
import pandas as pd
path = '../data/'
url = path + 'test.csv'
titanic_test = pd.read_csv(url)
titanic_test.shape

(418, 11)

In [168]:
# check for missing values
titanic_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [169]:
titanic_test.Age.fillna(titanic_test.Age.median(), inplace=True)
titanic_test.Fare.fillna(titanic_test.Fare.median(), inplace=True)

## Part 2: Handling categorical features

- **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
- **Unordered categories:** use dummy encoding (0/1)

In [170]:
# Create and encode Female feature
titanic['Female'] = titanic.Sex.map({'male':0, 'female':1})
titanic_test['Female'] = titanic_test.Sex.map({'male':0, 'female':1})

In [171]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [172]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic_test.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, embarked_dummies], axis=1)

In [173]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, pclass_dummies], axis=1)

In [174]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic_test.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, pclass_dummies], axis=1)


In [175]:
# Combine Sibling and Parent Columns
titanic['Family'] =  titanic["Parch"] + titanic["SibSp"]
titanic['Family'].loc[titanic['Family'] > 0] = 1
titanic['Family'].loc[titanic['Family'] == 0] = 0

titanic_test['Family'] =  titanic_test["Parch"] + titanic_test["SibSp"]
titanic_test['Family'].loc[titanic_test['Family'] > 0] = 1
titanic_test['Family'].loc[titanic_test['Family'] == 0] = 0

In [176]:
titanic.drop("Cabin",axis=1,inplace=True)
titanic.drop("Name",axis=1,inplace=True)
titanic.drop("Sex",axis=1,inplace=True)
titanic.drop("Ticket",axis=1,inplace=True)
titanic.drop("Embarked",axis=1,inplace=True)
titanic.drop("Pclass",axis=1,inplace=True)
titanic.drop("Parch",axis=1,inplace=True)
titanic.drop("SibSp",axis=1,inplace=True)

titanic_test.drop("Cabin",axis=1,inplace=True)
titanic_test.drop("Name",axis=1,inplace=True)
titanic_test.drop("Sex",axis=1,inplace=True)
titanic_test.drop("Ticket",axis=1,inplace=True)
titanic_test.drop("Embarked",axis=1,inplace=True)
titanic_test.drop("Pclass",axis=1,inplace=True)
titanic_test.drop("Parch",axis=1,inplace=True)
titanic_test.drop("SibSp",axis=1,inplace=True)

In [177]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived      891 non-null int64
Age           891 non-null float64
Fare          891 non-null float64
Female        891 non-null int64
Embarked_Q    891 non-null float64
Embarked_S    891 non-null float64
Pclass_2      891 non-null float64
Pclass_3      891 non-null float64
Family        891 non-null int64
dtypes: float64(6), int64(3)
memory usage: 69.6 KB


In [178]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
PassengerId    418 non-null int64
Age            418 non-null float64
Fare           418 non-null float64
Female         418 non-null int64
Embarked_Q     418 non-null float64
Embarked_S     418 non-null float64
Pclass_2       418 non-null float64
Pclass_3       418 non-null float64
Family         418 non-null int64
dtypes: float64(6), int64(3)
memory usage: 29.5 KB


# Part 3: Train and Tune the Model

In [179]:

# define training and testing sets

X_train = titanic.drop("Survived",axis=1)
y_train = titanic["Survived"]
X_test  = titanic_test.drop("PassengerId",axis=1).copy()

# import KNN class we need from scikit-learnQ
from sklearn.neighbors import KNeighborsClassifier

# instantiate the estimator 
knn = KNeighborsClassifier(n_neighbors=2, weights='distance') # Tune these parameters!

# run a knn.fit on the data to build the model
knn.fit(X_train, y_train)

titanic['y_pred_class_knn']=knn.predict(X_train)

# Test the accuracy
print knn.score(X_train, y_train)

0.978675645342


In [180]:
# Random Forests
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)

random_forest.fit(X_train, y_train)

random_forest.score(X_train, y_train)

0.97979797979797978

# Part 4: Make Predictions

## Update Test Dataset & Create Submission File

In [181]:
# make predictions for testing set
titanic_test['Survived'] = random_forest.predict(X_test)


path = '../data/'
url = path + 'submit_randomforest.csv'
titanic_test.index = titanic_test.PassengerId
titanic_test.to_csv(columns = ['Survived'], path_or_buf = url, header=True)

## Sanity Check ##

In [182]:
print titanic.Survived.value_counts() / titanic.Survived.count()
print titanic_test.Survived.value_counts() / titanic_test.Survived.count()

0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    0.636364
1    0.363636
Name: Survived, dtype: float64


In [183]:
titanic_test.shape

(418, 10)

In [184]:
path = '../data/'
my_file = 'submit.csv'
comp_file = 'OmarElGabry.csv' # Downloaded this file from Kaggle as a comparison.

url = path + my_file
my_df = pd.read_csv(url, index_col='PassengerId')

url = path + comp_file
comp_df = pd.read_csv(url, index_col='PassengerId')

joined_df = pd.concat([my_df, comp_df], axis=1)

In [185]:
joined_df.columns=['Mine', 'Compare']
joined_df.head()

Unnamed: 0_level_0,Mine,Compare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,0,0
893,0,0
894,0,0
895,1,1
896,1,0


In [186]:
joined_df[joined_df.Mine != joined_df.Compare].count()

Mine       44
Compare    44
dtype: int64