# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/) & [Dominik Krzemiński](https://github.com/dokato/)

for El Passion, 2017

## 8.  Titanic project

!["titanic"](https://upload.wikimedia.org/wikipedia/commons/5/51/Titanic-New_York_Herald_front_page.jpeg)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('ggplot')

%matplotlib inline

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

## Data exploration

In [None]:
data = pd.read_csv("data/titanic.csv")

Let's take a look at the first 10 rows of the data.

In [None]:
data.head(10)

Description:

- _Survived_ - Survival (0 = No, 1 = Yes)
- _Pclass_ - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- _Sex_ - sex	
- _Age_  - age in years	
- _SibSp_ - nubmer of siblings / spouses aboard the Titanic	
- _Parch_ - number of parents / children aboard the Titanic	
- _Ticket_ - ticket number	
- _Fare_ - passenger fare	
- _Cabin_ - cabin number	
- _Embarked_ - port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


Let's summarize quantitative data.

In [None]:
data.describe()

## Visualization

In this part we will make some plots to better understand what our data can tell us about passengers.

In [None]:
pd.value_counts(data.Pclass).plot(kind='bar', rot=1)
plt.title('Number of passengers in each class')
plt.xlabel('Passenger Class')
plt.show()

In [None]:
pd.value_counts(data.Sex).plot(kind='bar', rot=1, color='g')
plt.title('Number of passengers depending on gender')
plt.show()

In [None]:
ax = plt.subplot(111)
pd.value_counts(data.Survived).plot(kind='bar', rot=1, color='b')
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[0] = 'Not survived'
labels[1] = 'Survived'
ax.set_xticklabels(labels)
plt.title('Number of survivors')
plt.show()

In [None]:
facet = sns.FacetGrid(data, hue='Survived' , aspect=4,row = 'Sex')
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, data['Age'].max()))
facet.add_legend()

All above pictures present relation of number of Titanic passengers depending on chosen features: age, gender, class. 

- First picture shows us that the majority of travellers were third class passengers;
- From second we can conclude that more than 60 % of them were males;
- Bar plot in third picture presents well-known fact that most passengers got drawned;
- From histograms above it's easily visible that passenger in the first order tried to save children and also that elder people had the lowest chances to survive.

In [None]:
grdata = pd.groupby(data, 'Pclass')
grdata.Fare.mean().plot('bar', color='y', rot=1)
plt.title('Average passenger fare in each class')
plt.xlabel('Passenger Class')
plt.show()

In [None]:
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data,
              palette={"male": "blue", "female": "red"},
              markers=["*", "o"], linestyles=["-", "--"])
plt.show()

In the first picture we see an average fare depending on class. It's not surprising that the highest prices were in first class, but the difference between first and second is significant. What about chances to survive? From the second picture we see that well-off passengers had greater chances to survive, which also agree with image of the catastrophe from James Cameron movie.

In [None]:
sns.factorplot('SibSp', 'Survived', data=data, size=7)
plt.show()

## Warm-up exercises

(a) Create factorplot for number of parents / children aboard the Titanic.

(b) What does it tell about survivance?

## Filling-in missing data

In [None]:
data['Fare'].fillna(data['Fare'].median(), inplace=True)
data['Embarked'].fillna('S', inplace=True)

## Prediction

Your task is to write classifier, which predicts `Survived` label from given Titanic data.

Before writing any code, think about following questions:

- what kind of problem you approach: regression or classification?

- what kind of algorithms you can use?

- what features are the most informative?

- how to test your classifier performance?

In [None]:
preddataX = data[[ ... ]] # here specify interesting 
                          # features as string separated by commas e.g. 'Age', 'Pclass'

# if you want to use gender code it to binary values
#preddataX['Sex'] = (preddataX['Sex']=='male').astype(int)

preddataY = data['Survived']

# transform to numpy arrays
X = preddataX.as_matrix()
y = preddataY.as_matrix()

Let's split the data into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

That's all you need to perform the task. Good luck :)