# Titanic Survival Prediction

## Approach

**Data preparation**

- Handling missing values
- Handling categorical features

**Train & Tune Model**

- Train model
- Test accuracy
- Tune model parameters

**Make Prediction**
- Update test data set
- Sanity Check

## Part 1: Load Data & Handling Missing Values

In [341]:
# read the Titanic training data
import pandas as pd
path = '../data/'
url = path + 'train.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

In [342]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [343]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

In [344]:
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [345]:
# fill missing values for Embarked with the mode
titanic.Embarked.fillna('S', inplace=True)

In [346]:
# read the Titanic test data
import pandas as pd
path = '../data/'
url = path + 'test.csv'
titanic_test = pd.read_csv(url)
titanic_test.shape

(418, 11)

In [347]:
# check for missing values
titanic_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [348]:
titanic_test.Age.fillna(titanic_test.Age.median(), inplace=True)
titanic_test.Fare.fillna(titanic_test.Fare.median(), inplace=True)

## Part 2: Handling categorical features

- **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
- **Unordered categories:** use dummy encoding (0/1)

In [349]:
# Create and encode Female feature - Replaced this below with a more granular definition
#titanic['Female'] = titanic.Sex.map({'male':0, 'female':1})
#titanic_test['Female'] = titanic_test.Sex.map({'male':0, 'female':1})

In [350]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [351]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic_test.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, embarked_dummies], axis=1)

In [352]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, pclass_dummies], axis=1)

In [353]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic_test.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, pclass_dummies], axis=1)


In [354]:
# Combine Sibling and Parent Columns
titanic['Family'] =  titanic["Parch"] + titanic["SibSp"]
titanic['Family'].loc[titanic['Family'] > 0] = 1
titanic['Family'].loc[titanic['Family'] == 0] = 0

# This apporach did not improve the accuracy score
#titanic['FamilySize'] =  titanic["Parch"] + titanic["SibSp"]
#titanic.loc[titanic.FamilySize==1,'FamilyLabel'] = 'Single'
#titanic.loc[titanic.FamilySize==2,'FamilyLabel'] = 'Couple'
#titanic.loc[(titanic.FamilySize>2)&(titanic.FamilySize<=4),'FamilyLabel'] = 'Small'
#titanic.loc[titanic.FamilySize>4,'FamilyLabel'] = 'Big'

titanic_test['Family'] =  titanic_test["Parch"] + titanic_test["SibSp"]
titanic_test['Family'].loc[titanic_test['Family'] > 0] = 1
titanic_test['Family'].loc[titanic_test['Family'] == 0] = 0

#titanic_test['FamilySize'] =  titanic_test["Parch"] + titanic_test["SibSp"]
#titanic_test.loc[titanic_test.FamilySize==1,'FamilyLabel'] = 'Single'
#titanic_test.loc[titanic_test.FamilySize==2,'FamilyLabel'] = 'Couple'
#titanic_test.loc[(titanic_test.FamilySize>2)&(titanic_test.FamilySize<=4),'FamilyLabel'] = 'Small'
#titanic_test.loc[titanic_test.FamilySize>4,'FamilyLabel'] = 'Big'




In [355]:
# create a DataFrame of dummy variables for FamilyLabel
#familylabel_dummies = pd.get_dummies(titanic.FamilyLabel, prefix='FamilyLabel')
#familylabel_dummies.drop(familylabel_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
#titanic = pd.concat([titanic, familylabel_dummies], axis=1)

In [356]:
# create a DataFrame of dummy variables for FamilyLabel
#familylabel_dummies = pd.get_dummies(titanic_test.FamilyLabel, prefix='FamilyLabel')
#familylabel_dummies.drop(familylabel_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
#titanic_test = pd.concat([titanic_test, familylabel_dummies], axis=1)

In [357]:
# Children have a high rate of survival regardless of sex, so treat them as separate
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 15 else sex
    
titanic['Person'] = titanic[['Age','Sex']].apply(get_person,axis=1)
titanic_test['Person']    = titanic_test[['Age','Sex']].apply(get_person,axis=1)

# create dummy variables for Person column, & drop Male as it has the lowest average of survived passengers
person_dummies_titanic  = pd.get_dummies(titanic['Person'])
person_dummies_titanic.columns = ['Child','Female','Male']
person_dummies_titanic.drop(['Male'], axis=1, inplace=True)

person_dummies_test  = pd.get_dummies(titanic_test['Person'])
person_dummies_test.columns = ['Child','Female','Male']
person_dummies_test.drop(['Male'], axis=1, inplace=True)

titanic = pd.concat([titanic, person_dummies_titanic], axis=1)
titanic_test = pd.concat([titanic_test, person_dummies_test], axis=1)

In [358]:
titanic.drop("Cabin",axis=1,inplace=True)
titanic.drop("Name",axis=1,inplace=True)
titanic.drop("Sex",axis=1,inplace=True)
titanic.drop("Ticket",axis=1,inplace=True)
titanic.drop("Embarked",axis=1,inplace=True)
titanic.drop("Pclass",axis=1,inplace=True)
titanic.drop("Parch",axis=1,inplace=True)
titanic.drop("SibSp",axis=1,inplace=True)
#titanic.drop("FamilySize",axis=1,inplace=True)
#titanic.drop("FamilyLabel",axis=1,inplace=True)
titanic.drop("Person",axis=1,inplace=True)


titanic_test.drop("Cabin",axis=1,inplace=True)
titanic_test.drop("Name",axis=1,inplace=True)
titanic_test.drop("Sex",axis=1,inplace=True)
titanic_test.drop("Ticket",axis=1,inplace=True)
titanic_test.drop("Embarked",axis=1,inplace=True)
titanic_test.drop("Pclass",axis=1,inplace=True)
titanic_test.drop("Parch",axis=1,inplace=True)
titanic_test.drop("SibSp",axis=1,inplace=True)
#titanic_test.drop("FamilySize",axis=1,inplace=True)
#titanic_test.drop("FamilyLabel",axis=1,inplace=True)
titanic_test.drop("Person",axis=1,inplace=True)

In [359]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
Survived      891 non-null int64
Age           891 non-null float64
Fare          891 non-null float64
Embarked_Q    891 non-null float64
Embarked_S    891 non-null float64
Pclass_2      891 non-null float64
Pclass_3      891 non-null float64
Family        891 non-null int64
Child         891 non-null float64
Female        891 non-null float64
dtypes: float64(8), int64(2)
memory usage: 76.6 KB


In [360]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
PassengerId    418 non-null int64
Age            418 non-null float64
Fare           418 non-null float64
Embarked_Q     418 non-null float64
Embarked_S     418 non-null float64
Pclass_2       418 non-null float64
Pclass_3       418 non-null float64
Family         418 non-null int64
Child          418 non-null float64
Female         418 non-null float64
dtypes: float64(8), int64(2)
memory usage: 32.7 KB


# Part 3: Train and Tune the Model

In [361]:

# define training and testing sets

X_train = titanic.drop("Survived",axis=1)
y_train = titanic["Survived"]
X_test  = titanic_test.drop("PassengerId",axis=1).copy()

# import KNN class we need from scikit-learnQ
from sklearn.neighbors import KNeighborsClassifier

# instantiate the estimator 
knn = KNeighborsClassifier(n_neighbors=2, weights='distance') # Tune these parameters!

# run a knn.fit on the data to build the model
knn.fit(X_train, y_train)

titanic['y_pred_class_knn']=knn.predict(X_train)

# Test the accuracy
print knn.score(X_train, y_train)

0.978675645342


In [362]:
# Random Forests
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)

random_forest.fit(X_train, y_train)

random_forest.score(X_train, y_train)

0.978675645342312

# Part 4: Make Predictions

## Update Test Dataset & Create Submission File

In [363]:
# make predictions for testing set
titanic_test['Survived'] = random_forest.predict(X_test)


path = '../data/'
url = path + 'submit_randomforest_v3.csv'
titanic_test.index = titanic_test.PassengerId
titanic_test.to_csv(columns = ['Survived'], path_or_buf = url, header=True)

## Sanity Check ##

In [364]:
print titanic.Survived.value_counts() / titanic.Survived.count()
print titanic_test.Survived.value_counts() / titanic_test.Survived.count()

0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    0.633971
1    0.366029
Name: Survived, dtype: float64


In [365]:
titanic_test.shape

(418, 11)

In [366]:
path = '../data/'
my_file = 'submit_randomforest_v3.csv'
comp_file = 'OmarElGabry.csv' # Downloaded this file from Kaggle as a comparison.

url = path + my_file
my_df = pd.read_csv(url, index_col='PassengerId')

url = path + comp_file
comp_df = pd.read_csv(url, index_col='PassengerId')

joined_df = pd.concat([my_df, comp_df], axis=1)

In [367]:
joined_df.columns=['Mine', 'Compare']
joined_df.head()

Unnamed: 0_level_0,Mine,Compare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,0,0
893,0,0
894,0,0
895,1,1
896,0,0


In [368]:
joined_df[joined_df.Mine != joined_df.Compare].count()

Mine       41
Compare    41
dtype: int64

In [369]:
joined_df[joined_df.Mine != joined_df.Compare]

Unnamed: 0_level_0,Mine,Compare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
898,0,1
903,0,1
913,1,0
919,1,0
928,1,0
931,1,0
933,1,0
953,0,1
974,0,1
983,1,0
