# Titanic Predictor

This code takes csv data from the Kaggle website, cleans it, and runs it through a random forest classifier. Future goals would be to add more features to increase prediction accuracy. Interaction variables from Emma Gertlowitz's blog are: fare per person, AgeClass interaction, sexClass interaction, combined_age, family – sibsp + parch, (age_squared – combined_age) squared, and (age_class_squared – age_class) squared. Other features to add are extracting the title and turning the P-class into dummy variables. I've also heard that the age column is useless so not sure whether I should drop that or not. Another idea is to change the machine learning model to an SVM.

In [91]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

In [92]:
def cleanCSV(filename):
    newdf = pd.read_csv(open(filename))
    #dropping name, ticket, cabin, ID, and embarked columns
    newdf.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True) 
    # One-hot encoding Gender
    newdf['m'] = (newdf['Sex'] == 'male')
    newdf['f'] = (newdf['Sex'] == 'female')
    newdf.drop(['Sex'], axis=1, inplace=True)
    # Replace Nan values in age column with average age
    values={'Age': newdf['Age'].mean()}
    newdf.fillna(value=values, inplace=True)
    return newdf

In [93]:
#df = pd.read_csv(open('train.csv'))
df = cleanCSV('train.csv')
df.drop(['PassengerId'], axis=1, inplace=True) 
test = cleanCSV('test.csv')

In [94]:
# Create a list of the feature column's names
features = df.columns[1:len(df.columns)]
#Drop NaN value in fare column
test['Fare'].fillna(test['Fare'].mean(), inplace=True)

In [95]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)

In [96]:
# Holds survival data; this is what we want to predict-- the label
y = df['Survived']

In [97]:
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(df[features], y);

In [98]:
# Apply the Classifier we trained to the test data
survive = clf.predict(test[features])

In [99]:
final = zip(test['PassengerId'], survive)

In [100]:
final_prediction = pd.DataFrame(final, columns=['PassengerId', 'Survived'])

In [101]:
final_prediction.to_csv('final_prediction8.csv', index=False)