# Random Forest on Titanic Dataset
Getting started on kaggle with the titanic dataset:
I run randomforest on the training data, and output a .csv file suitable for uploading.

## Preamble

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

In [4]:
pd.options.mode.chained_assignment = None 

## Load data

In [5]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## Inspect data

In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Impute values, and transform non-numeric categories
These steps required the chained_assignment option to be changed, because there is something complex to be understood when it comes to changing values of a dataframe based on a copy of a slice. I don't understand this yet.
https://stackoverflow.com/questions/21463589/pandas-chained-assignments

In [8]:
train.Age = train.Age.fillna(train.Age.median())
train.Sex[train.Sex == 'male'] = 1
train.Sex[train.Sex == 'female'] = 0
train.Embarked = train.Embarked.fillna('S')
train.Embarked[train.Embarked == 'S'] = 0
train.Embarked[train.Embarked == 'C'] = 1
train.Embarked[train.Embarked == 'Q'] = 2

test.Fare[152] = test.Fare.median()
test.Embarked = test.Embarked.fillna('S')
test.Embarked[test.Embarked == 'S'] = 0
test.Embarked[test.Embarked == 'C'] = 1
test.Embarked[test.Embarked == 'Q'] = 2
test.Sex[test.Sex == 'male'] = 1
test.Sex[test.Sex == 'female'] = 0
test.Age = test.Age.fillna(test.Age.median())

## Build the model
Initialize the random forest, and the feature and target vectors

In [9]:
features = train[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
target = train.Survived

In [10]:
forest = RandomForestClassifier(max_depth=10, min_samples_split=2, n_estimators=100)
forest = forest.fit(features, target)

## Inspect the model

In [11]:
zip(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], forest.feature_importances_)

[('Pclass', 0.10643746408457368),
 ('Sex', 0.31376480997473871),
 ('Age', 0.21080908846988261),
 ('SibSp', 0.053744657456297015),
 ('Parch', 0.041783188236818265),
 ('Fare', 0.23892497031844959),
 ('Embarked', 0.034535821459240112)]

## Predict
Use the model to predict the feature of the test sample

In [12]:
test_features = test[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
prediction = forest.predict(test_features)

## Generate output
Use a dataframe to output a csv file to upload to Kaggle

In [13]:
solution = pd.DataFrame(prediction, test.PassengerId, columns=['Survived'])
solution

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0
897,0
898,0
899,0
900,1
901,0


In [14]:
solution.to_csv('forest_sol.csv', index_label=['PassengerId'])