# Titanic Survival Prediction

## Approach

**Data preparation**

- Handling missing values
- Handling categorical features

**Train & Tune Model**

- Train model
- Test accuracy
- Tune model parameters

**Make Prediction**
- Update test data set
- Sanity Check

## Part 1: Load Data & Handling Missing Values

In [75]:
# Tuning parameters
rf_tune = False
knn_tune = False

In [76]:
# read the Titanic training data
import numpy as np
import pandas as pd
path = '../data/'
url = path + 'train.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

In [77]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [78]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

# get average, std, and number of NaN values in titanic_df
average_age_titanic   = titanic["Age"].mean()
std_age_titanic       = titanic["Age"].std()
count_nan_age_titanic = titanic["Age"].isnull().sum()

# generate random numbers between (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)

# fill NaN values in Age column with random values generated
titanic["Age"][np.isnan(titanic["Age"])] = rand_1

# convert from float to int
titanic['Age'] = titanic['Age'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [79]:
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [80]:
# fill missing values for Embarked with the mode
titanic.Embarked.fillna('S', inplace=True)

In [81]:
# read the Titanic test data
import pandas as pd
path = '../data/'
url = path + 'test.csv'
titanic_test = pd.read_csv(url)
titanic_test.shape

(418, 11)

In [82]:
# check for missing values
titanic_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [83]:
titanic_test.Age.fillna(titanic_test.Age.median(), inplace=True)

# get average, std, and number of NaN values in test_df
average_age_test   = titanic_test["Age"].mean()
std_age_test       = titanic_test["Age"].std()
count_nan_age_test = titanic_test["Age"].isnull().sum()

# generate random numbers between (mean - std) & (mean + std)
rand_2 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)

# fill NaN values in Age column with random values generated
titanic_test["Age"][np.isnan(titanic_test["Age"])] = rand_2

# convert from float to int
titanic_test['Age'] = titanic_test['Age'].astype(int)

titanic_test.Fare.fillna(titanic_test.Fare.median(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Part 2: Handling categorical features

- **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
- **Unordered categories:** use dummy encoding (0/1)

In [84]:
# Create and encode Female feature - Replaced this below with a more granular definition
#titanic['Female'] = titanic.Sex.map({'male':0, 'female':1})
#titanic_test['Female'] = titanic_test.Sex.map({'male':0, 'female':1})

In [85]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [86]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic_test.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, embarked_dummies], axis=1)

In [87]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, pclass_dummies], axis=1)

In [88]:
# create a DataFrame of dummy variables for Embarked
pclass_dummies = pd.get_dummies(titanic_test.Pclass, prefix='Pclass')
pclass_dummies.drop(pclass_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic_test = pd.concat([titanic_test, pclass_dummies], axis=1)


In [89]:
# Combine Sibling and Parent Columns
titanic['Family'] =  titanic["Parch"] + titanic["SibSp"]
titanic['Family'].loc[titanic['Family'] > 0] = 1
titanic['Family'].loc[titanic['Family'] == 0] = 0

# This apporach did not improve the accuracy score
#titanic['FamilySize'] =  titanic["Parch"] + titanic["SibSp"]
#titanic.loc[titanic.FamilySize==1,'FamilyLabel'] = 'Single'
#titanic.loc[titanic.FamilySize==2,'FamilyLabel'] = 'Couple'
#titanic.loc[(titanic.FamilySize>2)&(titanic.FamilySize<=4),'FamilyLabel'] = 'Small'
#titanic.loc[titanic.FamilySize>4,'FamilyLabel'] = 'Big'

titanic_test['Family'] =  titanic_test["Parch"] + titanic_test["SibSp"]
titanic_test['Family'].loc[titanic_test['Family'] > 0] = 1
titanic_test['Family'].loc[titanic_test['Family'] == 0] = 0

#titanic_test['FamilySize'] =  titanic_test["Parch"] + titanic_test["SibSp"]
#titanic_test.loc[titanic_test.FamilySize==1,'FamilyLabel'] = 'Single'
#titanic_test.loc[titanic_test.FamilySize==2,'FamilyLabel'] = 'Couple'
#titanic_test.loc[(titanic_test.FamilySize>2)&(titanic_test.FamilySize<=4),'FamilyLabel'] = 'Small'
#titanic_test.loc[titanic_test.FamilySize>4,'FamilyLabel'] = 'Big'




In [90]:
# create a DataFrame of dummy variables for FamilyLabel
#familylabel_dummies = pd.get_dummies(titanic.FamilyLabel, prefix='FamilyLabel')
#familylabel_dummies.drop(familylabel_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
#titanic = pd.concat([titanic, familylabel_dummies], axis=1)

In [91]:
# create a DataFrame of dummy variables for FamilyLabel
#familylabel_dummies = pd.get_dummies(titanic_test.FamilyLabel, prefix='FamilyLabel')
#familylabel_dummies.drop(familylabel_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
#titanic_test = pd.concat([titanic_test, familylabel_dummies], axis=1)

In [92]:
# Children have a high rate of survival regardless of sex, so treat them as separate
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
titanic['Person'] = titanic[['Age','Sex']].apply(get_person,axis=1)
titanic_test['Person']    = titanic_test[['Age','Sex']].apply(get_person,axis=1)

# create dummy variables for Person column, & drop Male as it has the lowest average of survived passengers
person_dummies_titanic  = pd.get_dummies(titanic['Person'])
person_dummies_titanic.columns = ['Child','Female','Male']
person_dummies_titanic.drop(['Male'], axis=1, inplace=True)

person_dummies_test  = pd.get_dummies(titanic_test['Person'])
person_dummies_test.columns = ['Child','Female','Male']
person_dummies_test.drop(['Male'], axis=1, inplace=True)

titanic = pd.concat([titanic, person_dummies_titanic], axis=1)
titanic_test = pd.concat([titanic_test, person_dummies_test], axis=1)

In [93]:
titanic.drop("Cabin",axis=1,inplace=True)
titanic.drop("Name",axis=1,inplace=True)
titanic.drop("Sex",axis=1,inplace=True)
titanic.drop("Ticket",axis=1,inplace=True)
titanic.drop("Embarked",axis=1,inplace=True)
titanic.drop("Pclass",axis=1,inplace=True)
titanic.drop("Parch",axis=1,inplace=True)
titanic.drop("SibSp",axis=1,inplace=True)
#titanic.drop("FamilySize",axis=1,inplace=True)
#titanic.drop("FamilyLabel",axis=1,inplace=True)
titanic.drop("Person",axis=1,inplace=True)


titanic_test.drop("Cabin",axis=1,inplace=True)
titanic_test.drop("Name",axis=1,inplace=True)
titanic_test.drop("Sex",axis=1,inplace=True)
titanic_test.drop("Ticket",axis=1,inplace=True)
titanic_test.drop("Embarked",axis=1,inplace=True)
titanic_test.drop("Pclass",axis=1,inplace=True)
titanic_test.drop("Parch",axis=1,inplace=True)
titanic_test.drop("SibSp",axis=1,inplace=True)
#titanic_test.drop("FamilySize",axis=1,inplace=True)
#titanic_test.drop("FamilyLabel",axis=1,inplace=True)
titanic_test.drop("Person",axis=1,inplace=True)

In [94]:
titanic.Embarked_Q = titanic.Embarked_Q.astype(int)
titanic.Embarked_S = titanic.Embarked_S.astype(int)
titanic.Pclass_2 = titanic.Pclass_2.astype(int)
titanic.Pclass_3 = titanic.Pclass_3.astype(int)
titanic.Family = titanic.Family.astype(int)
titanic.Child = titanic.Child.astype(int)
titanic.Female = titanic.Female.astype(int)
titanic.Survived = titanic.Survived.astype(int)

titanic_test.Embarked_Q = titanic_test.Embarked_Q.astype(int)
titanic_test.Embarked_S = titanic_test.Embarked_S.astype(int)
titanic_test.Pclass_2 = titanic_test.Pclass_2.astype(int)
titanic_test.Pclass_3 = titanic_test.Pclass_3.astype(int)
titanic_test.Family = titanic_test.Family.astype(int)
titanic_test.Child = titanic_test.Child.astype(int)
titanic_test.Female = titanic_test.Female.astype(int)

In [95]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
Survived      891 non-null int32
Age           891 non-null int32
Fare          891 non-null float64
Embarked_Q    891 non-null int32
Embarked_S    891 non-null int32
Pclass_2      891 non-null int32
Pclass_3      891 non-null int32
Family        891 non-null int32
Child         891 non-null int32
Female        891 non-null int32
dtypes: float64(1), int32(9)
memory usage: 45.2 KB


In [96]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
PassengerId    418 non-null int64
Age            418 non-null int32
Fare           418 non-null float64
Embarked_Q     418 non-null int32
Embarked_S     418 non-null int32
Pclass_2       418 non-null int32
Pclass_3       418 non-null int32
Family         418 non-null int32
Child          418 non-null int32
Female         418 non-null int32
dtypes: float64(1), int32(8), int64(1)
memory usage: 19.7 KB


In [97]:
train_df = titanic.copy()
test_df = titanic_test.copy()

test_df.index = test_df.PassengerId
test_df.drop('PassengerId', axis=1, inplace=True)
print test_df.info()
print ' ----------------------- '
print train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 9 columns):
Age           418 non-null int32
Fare          418 non-null float64
Embarked_Q    418 non-null int32
Embarked_S    418 non-null int32
Pclass_2      418 non-null int32
Pclass_3      418 non-null int32
Family        418 non-null int32
Child         418 non-null int32
Female        418 non-null int32
dtypes: float64(1), int32(8)
memory usage: 19.6 KB
None
 ----------------------- 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
Survived      891 non-null int32
Age           891 non-null int32
Fare          891 non-null float64
Embarked_Q    891 non-null int32
Embarked_S    891 non-null int32
Pclass_2      891 non-null int32
Pclass_3      891 non-null int32
Family        891 non-null int32
Child         891 non-null int32
Female        891 non-null int32
dtypes: float64(1), int32(9)
memory usage: 45.2 KB
None


In [98]:
merged_df = pd.concat([train_df, test_df], axis=0)

path = '../data/'
url = path + 'merged_train_and_test.csv'
merged_df.to_csv(columns = merged_df.columns, path_or_buf = url, header=True)

# Part 3: Train and Tune the Model

## Define Training and Test dataframes
### Start with the full list of features and evaluate them with each model

In [99]:
X_train = titanic.drop("Survived",axis=1)
y_train = titanic["Survived"]
X_test  = titanic_test.drop("PassengerId",axis=1).copy()

# K Nearest Neighbors Classifier
## Tune n_neighbors parameter

In [101]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
import matplotlib.pyplot as plt
%matplotlib inline



# search for an optimal value of K for KNN
def knn_nestimators_tuning():
    knn = KNeighborsClassifier()
    k_range = range(1, 51)
    weight_options = ['uniform', 'distance']
    param_grid = dict(n_neighbors=k_range, weights=weight_options)
    grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
    grid.fit(X_train, y_train)

    
    print grid.best_score_
    print grid.best_params_
    return  grid.best_estimator_
    
if knn_tune == True:
    knn = knn_nestimators_tuning()
else:
    knn = KNeighborsClassifier(n_neighbors=33, weights='distance')
    print cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy').mean()

0.731903018954


## Feature Evaluation
### Start with the feature that have the highest correlation and build up from there

In [105]:
# Feature Evaluation - Started with Fare, Age and Female and added features until the cross validation score stopped increasing
# manually ran this step with different feature combinations
feature_cols_knn = ['Fare',  'Age', 'Female', 'Pclass_3', 'Child', 'Embarked_S', 'Family']
X_train = titanic[feature_cols_knn]
X_test = titanic_test[feature_cols_knn]

if knn_tune == True:
    # recalculate accuracy
    print cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy').mean()
    # repeat nestimators tuning to see if anything has changed since downselecting features
    knn = knn_nestimators_tuning()
else:
    knn.fit(X_train, y_train)

In [106]:
# run a knn.fit on the data to build the model
#knn.fit(X_train, y_train)

y_pred_class_knn = knn.predict(X_train)

# Test the accuracy
print knn.score(X_train, y_train)

0.979797979798


## Random Forest Classifier

### Start with the Full List of Features and Edit down from there

In [107]:
X_train = titanic.drop("Survived",axis=1)

In [108]:
# Random Forests
from sklearn.ensemble import RandomForestClassifier

## Tuning the n_estimators and max_features parameters

In [109]:

def rfclsf_nestimators_tuning():
    rfclsf = RandomForestClassifier(random_state=1)
    estimator_range = range(10, 310, 10)
    feature_range = range(1, len(X_train.columns)+1)
    param_grid = dict(n_estimators=estimator_range, max_features=feature_range)
    grid = GridSearchCV(rfclsf, param_grid, cv=10, scoring='mean_squared_error')
    grid.fit(X_train, y_train)


    print grid.best_score_
    print grid.best_params_
    return  grid.best_estimator_

if rf_tune == True:
    rfclsf = rfclsf_nestimators_tuning()
else:
    rfclsf = RandomForestClassifier(n_estimators=260, max_features=6, oob_score=True, random_state=1)
    print cross_val_score(rfclsf, X_train, y_train, cv=10, scoring='accuracy').mean()
    # compute the out-of-bag R-squared score
    rfclsf.fit(X_train, y_train)
    print rfclsf.oob_score_

0.821656735898
0.817059483726


In [110]:
X_train.columns

Index([u'Age', u'Fare', u'Embarked_Q', u'Embarked_S', u'Pclass_2', u'Pclass_3',
       u'Family', u'Child', u'Female'],
      dtype='object')

In [111]:
# compute feature importances
pd.DataFrame({'feature':X_train.columns, 'importance':rfclsf.feature_importances_}).sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
1,Fare,0.301945
0,Age,0.261031
8,Female,0.253307
5,Pclass_3,0.083178
7,Child,0.025755
3,Embarked_S,0.022752
6,Family,0.021671
4,Pclass_2,0.019157
2,Embarked_Q,0.011204


In [112]:
y_pred_class_rfclsf = rfclsf.predict(X_train)

In [274]:
# Support Vector Machines
from sklearn.svm import SVC, LinearSVC

svc = SVC()
svc.fit(X_train, y_train)

y_pred_class_svc = svc.predict(X_train)

svc.score(X_train, y_train)

0.8630751964085297

In [275]:
from sklearn.naive_bayes import GaussianNB
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)

y_pred_class_gaussian = gaussian.predict(X_train)

gaussian.score(X_train, y_train)

0.78114478114478114

In [276]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_class_logreg = logreg.predict(X_train)

logreg.score(X_train, y_train)

0.80471380471380471

In [278]:
from sklearn import metrics
y_pred_class_ensemble = ((y_pred_class_logreg + y_pred_class_gaussian + y_pred_class_svc + y_pred_class_rfclsf
                                     + y_pred_class_knn)/5).round(0).astype(int)
print metrics.accuracy_score(titanic.Survived, y_pred_class_ensemble)

0.763187429854


In [286]:
path = '../data/'
url = path + 'train_ensemble.csv'
ensemble_df = pd.DataFrame({'PassengerId' : titanic.index,
                        'Survived' : y_train, 
                        'y_pred_class_ensemble' : y_pred_class_ensemble,
                        'y_pred_class_logreg' : y_pred_class_logreg,
                        'y_pred_class_gaussian' : y_pred_class_gaussian, 
                        'y_pred_class_svc' : y_pred_class_svc,
                        'y_pred_class_rfclsf' : y_pred_class_rfclsf, 
                        'y_pred_class_knn' : y_pred_class_knn}).set_index('PassengerId')
#titanic.index = titanic.PassengerId
ensemble_df.to_csv(columns = ensemble_df.columns, path_or_buf = url, header=True)

# Part 4: Make Predictions

## Update Test Dataset & Create Submission File

In [117]:
X_test.shape
titanic_test.shape

(418, 10)

In [118]:
#titanic_test['y_pred_class_knn']=knn.predict(X_test)
#titanic_test['y_pred_class_logreg']=logreg.predict(X_test)
#titanic_test['y_pred_class_gaussian']=gaussian.predict(X_test)
#titanic_test['y_pred_class_svc']=svc.predict(X_test)
titanic_test['y_pred_class_rfclsf']=rfclsf.predict(X_test)

In [648]:
titanic_test['y_pred_class_ensemble'] = ((titanic_test['y_pred_class_logreg'] + titanic_test['y_pred_class_gaussian'] + 
    titanic_test['y_pred_class_svc'] + titanic_test['y_pred_class_rfclsf'] + titanic_test['y_pred_class_knn'])/5).round(0).astype(int)

In [651]:
# make predictions for testing set
titanic_test['Survived'] = titanic_test['y_pred_class_ensemble']

path = '../data/'
url = path + 'submit_ensemble_v1.csv'
titanic_test.index = titanic_test.PassengerId
titanic_test.to_csv(columns = ['Survived'], path_or_buf = url, header=True)

In [119]:
# make predictions for testing set
titanic_test['Survived'] = titanic_test['y_pred_class_rfclsf']

path = '../data/'
url = path + 'submit_randomforest_v6.csv'
titanic_test.index = titanic_test.PassengerId
titanic_test.to_csv(columns = ['Survived'], path_or_buf = url, header=True)

## Sanity Check ##

In [120]:
print titanic.Survived.value_counts() / titanic.Survived.count()
print titanic_test.Survived.value_counts() / titanic_test.Survived.count()

0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    0.636364
1    0.363636
Name: Survived, dtype: float64


In [121]:
titanic_test.shape

(418, 12)

In [125]:
path = '../data/'
my_file = 'OmarElGabry.csv'
comp_file = 'submit_randomforest_v6.csv' # Downloaded this file from Kaggle as a comparison.

url = path + my_file
my_df = pd.read_csv(url, index_col='PassengerId')

url = path + comp_file
comp_df = pd.read_csv(url, index_col='PassengerId')

joined_df = pd.concat([my_df, comp_df], axis=1)

In [126]:
joined_df.columns=['Mine', 'Compare']
joined_df.head()

Unnamed: 0_level_0,Mine,Compare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,0,0
893,0,0
894,0,0
895,1,1
896,0,0


In [127]:
joined_df[joined_df.Mine != joined_df.Compare].count()

Mine       44
Compare    44
dtype: int64

In [661]:
joined_df[joined_df.Mine != joined_df.Compare]

Unnamed: 0_level_0,Mine,Compare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
895,0,1
898,1,0
909,0,1
919,0,1
920,0,1
924,0,1
927,0,1
933,0,1
941,0,1
967,1,0
