Import libraries

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
import sklearn.ensemble as ske

Load data

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [None]:
train

In [None]:
test

Do some initial exploratory analysis

Bar Charts

In [None]:
# count of male and female
bar1 = train.groupby(['Sex'])['Sex'].count()

bar1.plot.bar()

In [None]:
# average fare by class
bar2 = train.groupby(['Pclass'])['Fare'].mean()

bar2.plot.bar()

In [None]:
# Number of passengers from each point of departure
bar3 = train.groupby(['Embarked'])['Embarked'].count()

bar3.plot.bar()

In [None]:
# Average Age by Passenger Class

bar4 = train.groupby(['Pclass'])['Age'].mean()

bar4.plot.bar()

In [None]:
# Survival rates by age bins
age_bins = pd.cut(train["Age"], np.arange(0, 90, 10))
bar5 = train.groupby(age_bins).mean()
bar5['Survived'].plot.bar()

In [None]:
# Survival rates by class
bar6 = train.groupby(["Pclass"]).mean()
bar6['Survived'].plot.bar()

Drop unnecessary variables for our regression. We drop ones that do not play a part in the calculation of our model

We're dropping these variables for the following reasons:
PassengerId - This variable is helpful in keeping track of the data, but is just a classification of the data.
Name - This is simply a name and so does not come into play in a model like this.
Cabin - Simply more data around how to classify this person, but not relevant to the model.
Ticket - Simply more data around how to classify this person, but not relevant to the model.
Parch - The interaction of this and other variables is too high. Similar to the reason given for Embarked, the impact of this would likely be encompassed by other variables.
SibSp - The interaction of this and other variables is too high. Similar to the reason given for Embarked, the impact of this would likely be encompassed by other variables.
Embarked - The potential value from Embarked would be related to the general class of people coming from each departure area. If all the people coming from 'S' were more wealthy, then they might have a better chance to survive given that they are in a higher class. However, this means that the variable for Pclass would do just as good of a job, if not better, at quantifying this effect. The multicollinearity would likely be very high.

In [None]:
# drop those variables from both train and test sets
train = train.drop(['PassengerId','Name','Cabin','Ticket','Parch','SibSp','Embarked'], axis=1)
test = test.drop(['Name','Cabin','Ticket','Parch','SibSp','Embarked'], axis=1)

How are we going to deal with missing data?

For age and fare, we can simply use the mean age.
from remaining columns

In [None]:
train.count()

In [None]:
# find the averages for age . . .
train["Age"].mean(skipna=True)

In [None]:
# . . . and fare
train["Fare"].mean(skipna=True)

In [None]:
train["Age"].fillna(29.7, inplace=True)
train["Fare"].fillna(34.69, inplace=True)

test["Age"].fillna(29.7, inplace=True)
test["Fare"].fillna(34.69, inplace=True)

In [None]:
train.count()

Identify datatypes of the dataset. Use this to change objects to numbers

In [None]:
train.dtypes

Turn Female/Male into 0/1

In [None]:
train['Sex'].replace(['female','male'],[0,1],inplace=True)
test['Sex'].replace(['female','male'],[0,1],inplace=True)

In [None]:
train['Sex']

Make datasets for training. Split the test dataset into an 80% training set, and an 20% test set

In [None]:
X = train.drop(["Survived"], axis = 1).values
Y = train["Survived"].values

In [None]:
train_X, test_X, train_Y, test_Y = cross_validation.train_test_split(X, Y, test_size = 0.2)

Model it, and score it

Random Forest

In [None]:
model = ske.RandomForestClassifier(n_estimators=100)

In [None]:
model.fit(train_X , train_Y)

In [None]:
print (model.score( train_X , train_Y ) , model.score( test_X , test_Y ))

Conclusion:
Since there is a large gap between the score of the training and test sets, this model is overfit to the training data. However, as overfit as it is, it seems to provide a decently strong prediction of the previously unknown data. I recommend updating this further with examples of other models and comparing them together. However, based on previous work with this dataset, I've found that Random Forest provides the best model for prediction.

Submission

In [None]:
submission_data = test.loc[:, test.columns != 'PassengerId']
submission_predictions = model.predict(submission_data)

In [None]:
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": submission_predictions
    })
submission.to_csv('Virshup BAX 452.csv', index=False)