# Kaggle- Titanic Machine Learning from Disaster

This is the kaggle machine learning tutorial using data from the Titanic disaster.  From the competition website http://www.kaggle.com/c/titanic-gettingStarted: 

>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

>In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Much of the code from this project is modified from the provided Kaggle benchmark myfirstforest.py script found at: https://www.kaggle.com/c/titanic/data. 

In [66]:
import pandas as pd 
import numpy as np
import csv as csv 
from sklearn.ensemble import RandomForestClassifier
# Eliminate false positive SettingWithCopyWarning
pd.options.mode.chained_assignment = None


# Data loading and cleaning:

Load data into Pandas from csv files:

In [69]:
train_df = pd.read_csv("data/train.csv", header = 0) #training data
test_df = pd.read_csv("data/test.csv", header = 0) #test data
test_ids=test_df['PassengerId'].values #store Ids from test data for later

Take a look at the contents of the training data:

In [70]:
train_df.head() #training data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Per Kaggle (https://www.kaggle.com/c/titanic/data) the meaning of the variables is:

>VARIABLE DESCRIPTIONS:
>survival        Survival
>                (0 = No; 1 = Yes)
>pclass          Passenger Class
>                (1 = 1st; 2 = 2nd; 3 = 3rd)
>name            Name
>sex             Sex
>age             Age
>sibsp           Number of Siblings/Spouses Aboard
>parch           Number of Parents/Children Aboard
>ticket          Ticket Number
>fare            Passenger Fare
>cabin           Cabin
>embarked        Port of Embarkation
>                (C = Cherbourg; Q = Queenstown; S = Southampton)

>SPECIAL NOTES:
>Pclass is a proxy for socio-economic status (SES)
> 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

>Age is in Years; Fractional if Age less than One (1)
> If the Age is Estimated, it is in the form xx.5

>With respect to the family relation variables (i.e. sibsp and parch)
>some relations were ignored.  The following are the definitions used
>for sibsp and parch.

>Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
>Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
>Parent:   Mother or Father of Passenger Aboard Titanic
>Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

>Other family relatives excluded from this study include cousins,
>nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
>only with a nanny, therefore parch=0 for them.  As well, some
>travelled with very close friends or neighbors in a village, however,
>the definitions do not support such relations.

We can next review the data to check it's completeness:

In [71]:
train_df.count()


PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [72]:
test_df.count()

PassengerId    418
Pclass         418
Name           418
Sex            418
Age            332
SibSp          418
Parch          418
Ticket         418
Fare           417
Cabin           91
Embarked       418
dtype: int64

We can see that the following fields have missing data:
* Age
* Fare
* Cabin
* Embarked

The cabin and ticket variables don't appear to have useful data for predicting survival and can be dropped. The age, cabin and fare missing fields could be useful for survival.  Missing data for these fields can be estimated based on other availible data for other passengers in the same class.

In addition, the Sex and Embarked fields will need to be converted to numerical values for analysis.  We can replace Sex as: Male = 0, Female = 1.  We can replace Embarked with 0= Cherbourg, 1= Queenstown and 2= Southhampton.

The following function will clean the data:

In [73]:
def clean_data(passed_df):
	
	##Convert "female" and "male" to be 0 and 1
	passed_df['Gender']=passed_df['Sex'].map({"female":0, "male":1}).astype(int)

	##convert empty embarkment to most common embarkment value
	if len(passed_df.Embarked[passed_df.Embarked.isnull() ] ) > 0:
		passed_df.Embarked[passed_df.Embarked.isnull() ] = passed_df.Embarked.dropna().mode().values

	##Convert embarked data into number values with a dictionary	
	Ports = list(enumerate(np.unique(passed_df['Embarked'])))
	Ports_dict = { name: i for i, name in Ports}
	passed_df.Embarked = passed_df.Embarked.map( lambda x: Ports_dict[x]).astype(int)

	#replace missing ages with median age
	median_age=passed_df.Age.dropna().median()
	if len(passed_df.Age[passed_df.Age.isnull() ]) > 0:
		median_age=np.zeros(3)
        ##calculate median age for each class
		for f in range(0,3):
			median_age[f]=passed_df[(passed_df.Pclass == f+1)]['Age'].dropna().median()
        ##Assign age based on passenger class    
		for f in range(0,3):
			passed_df.loc[((passed_df.Age.isnull()) & (passed_df.Pclass == f+1)), 'Age']=median_age[f]
			

	#replace missing fares with median fare for that class
	if len(passed_df.Fare[passed_df.Fare.isnull() ]) > 0:
		median_fare=np.zeros(3)
		for f in range(0,3):
			median_fare[f]=passed_df[(passed_df.Pclass == f+1)]['Fare'].dropna().median()
		for f in range(0,3):
			passed_df.loc[((passed_df.Fare.isnull()) & (passed_df.Pclass == f+1)), 'Fare']=median_fare[f]

	#drop fields not used for machine learning
	passed_df = passed_df.drop(["Name", "Sex", "Ticket", "Cabin", "PassengerId"], axis=1)
	
	return passed_df


Finally apply the data cleaning function to the training and test data: 

In [74]:
train_df=clean_data(train_df)
test_df=clean_data(test_df)

# Perform Machine Learning:

We then convert the cleaned data into values and perform training using a random forest classifier from scikit-learn:

In [75]:
#convert data to a list for the forest algorithm
train_data = train_df.values
test_data = test_df.values

forest=RandomForestClassifier(n_estimators=10000)
forest=forest.fit(train_data[0::,1::], train_data[0::,0])

We then apply training to the test data set and ouput the results to a CSV file:

In [76]:
output=forest.predict(test_data).astype(int)

predictions_file = open("titanicpredictions.csv", 'wb')
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(zip(test_ids, output))
predictions_file.close()

The results are submitted to Kaggle and a prediction accuracy of 0.75598 found for the model!