# Kaggle Machine Learning Competition: Predicting Titanic Survivors

Kaggle is an online resource for competetive data science competitions (now owned by Google).

They publish datasets and ask readers to submit algorithms that produce a desired result. The desired results depend on the problem, but you never get to see the test dataset that they use for scoring.

This is one of their "introductory" data science competitions. It's particularly interesting because it raises some very important ethical questions.

If you'd like to know more about the the competition visit the original [competition site](https://www.kaggle.com/c/titanic-gettingStarted).

## Dataset

The data has been loaded into this ipython notebook for you. Please use that for training/validation.

## Competition time!

We will run a little internal competition and the winner will get to talk through their result. Feel free to work in teams, or by yourself.

## Evaluation

For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict (i.e. accuracy).

We will compare scores at the end.

Good luck!

## Data Set

<pre>
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
</pre>

## Setup Imports and Variables

In [0]:
import pandas as pd
import numpy as np
import pylab as plt

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Size of matplotlib figures that contain subplots
fizsize_with_subplots = (10, 10)

# Size of matplotlib histogram bins
bin_size = 10

## Explore the Data

Read the data:

In [2]:
X_train = pd.read_csv("https://raw.githubusercontent.com/winderresearch/training-data-cleaning/master/data/titanic.csv")
X_train.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
X_train.tail()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


View the data types of each column:

In [4]:
X_train.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

Object types (usually strings) are a problem for most algorithms (maybe not trees?) so you'll usually have to convert these into useable numeric values.

Get some basic information on the DataFrame:

In [5]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.2+ KB


We can see that Age, Cabin and Embarked have missing (null) values. These will have to be imputed or removed.

Generate various descriptive statistics on the DataFrame:

In [6]:
X_train.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


Now that we have a general idea of what the dataset looks like, I would recommend that you use the following steps as a guideline. But of course, it's entirely up to you as to which approach you take.

## Feature Analysis

Go through each feature create some visual aids to help with analysis. I would start with histograms. For features with many classes (e.g. age) you may want to use a kernel density estimate, rather than a histogram.

Then, remember the goal, to predict survive or not survive. So for each of the features create bar charts for the feature showing survived and not survived components for each value of the feature.

## Data Cleaning

Once you have a thourough understanding of the data, it is time to try and clean up the data. You will have to choose what to do for each feature. E.g. drop rows/columns entirely? Impute values? Encode a value for the null?

If you're finding this hard, it might be easier to just start with a few features that are complete, or near complete.

## Modelling

This is the fun part. Start picking models that you can train your data upon. I'd recommend starting with something easy and as you gain confidence start considering more complex models.

Bear in mind that you will spend a lot of time going backwards and forwards to re-clean the data and tune the algorithm. So don't try and compare too many models. Pick a model and understand it's weaknesses before you move on.

You may even consider to take an entirely statistical approach at this point. A bayesian interpretation of the data could yeild some very interesting results (although this is probably more difficult at this stage, this isn't a statistics course :-) ).

## Evaluation

When you have finished, we will compare everyone's models. I will provide some code that you can all run.

To set expectations:
    
- 50% is just a random guess. This is your baseline.
- 75-85% is pretty good. You should be around this range.
- 100% is world-beating, but possible. If you get to 100% you've probably done something wrong. :-D