# Who was on the Titanic?

In this notebook, we will be using Python and a few of it's libraries to help us take an in depth look at the Titanic data set. In particular, we'll be looking for key patterns in the demographics such as survival rates of male vs females, percentage of first vs third class passengers, families with children and etc. 

Let's read in the train.csv file from Kaggle and import the Pandas library

In [5]:
#Import required packages
import pandas as pd
from pandas import Series,DataFrame

#Read in file
titanic_df = pd.read_csv('train.csv')

### A brief look at the training data

In [6]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We have 12 features (columns) in total. Since this is the training set, it has the "Survived" feature included. We can use this to see who survived the sinking and who didnt.

### Who survived? Who Didn't?

In [7]:
import plotly.plotly as py
from plotly.graph_objs import *
import numpy as np

In [8]:
import plotly 
plotly.tools.set_credentials_file(username='robinphetsa', api_key='hdfngeg4cd')

#### We'll first take a look at the age of our passengers. The two charts below offer us a glimpse at how age is distributed.

In [36]:
age_by_class = [{'y': data.values, 
                 'name': Pclass,
                 'type': 'box',
                 'boxpoints': 'all', 
                 'jitter': 0.3} for Pclass,data in list(titanic_df.groupby('Pclass')['Age'])]

layout1 = go.Layout(
    title = "Age distribution by Class",
    yaxis = dict(title = 'Age'),
    xaxis = dict(title = 'Class')
)

fig = go.Figure(data=age_by_class, layout=layout1)
py.iplot(fig)

 Here we see how age is distributed between the 3 different classes. From what we can see, it looks like the average age of 1st and 2nd class passengers is slightly higher than 3rd class passsengers. This would make sense as traditionaly, the upper classes were much older and most likely traveling for business or pleasure. The third class passengers were poorer and likely seeking a new beginning by traveling to the United States, therefore more likely to be young adults. The elder or younger passengers in 3rd class are most likely the relatives of such a passenger.

In [27]:
trace1 = go.Histogram(
    x=titanic_df[(titanic_df['Survived']==1)].Age,
    opacity=0.75,
    name = "Survived"
)
trace2 = go.Histogram(
    x=titanic_df[(titanic_df['Survived']==0)].Age,
    opacity=0.75,
    name = "Died"
)
data = [trace2, trace1]
layout = go.Layout(
    barmode='overlay',
    title = "Age Distribution of Passengers who Survived vs Died",
        xaxis = dict(title = 'Age'),
        yaxis = dict(title = 'Count')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Here we have the distribution of Age for passengers that survived imposed over the age distribution of those that died. Right off the bat, we can make one assumption already about the factors that affected the probability of survival: The two distributions are nearly identical! Therefore age may not be that good of a factor if we were to create a model to predict who survived.

### Let's take a look at gender

In [39]:
trace1 = go.Histogram(
    x=titanic_df[(titanic_df['Survived']==1)].Sex,
    opacity=0.75,
    name = "Survived"
)
trace2 = go.Histogram(
    x=titanic_df[(titanic_df['Survived']==0)].Sex,
    opacity=0.75,
    name = "Died"
)
data = [trace2, trace1]
layout = go.Layout(
    barmode='overlay',
    title = "Gender Count of Passengers Who Survived vs Died",
        xaxis = dict(title = 'Gender'),
        yaxis = dict(title = 'Count')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Here we can clearly see the disproportion between the survival rate of male vs female. For every female that survived, there were at nearly 6 men that died. You can get a better look at the differences by playing around with the interactive chart above; for example you can click on the legend on the upper right corner of the plot to filter out the plot for "Died" and "Survived"

In [44]:
trace1 = go.Histogram(
    x=titanic_df[(titanic_df['Sex']=='male')].Pclass,
    opacity=0.75,
    name = "male"
)
trace2 = go.Histogram(
    x=titanic_df[(titanic_df['Sex']=='female')].Pclass,
    opacity=0.75,
    name = "female"
)
data = [trace2, trace1]
layout = go.Layout(
    barmode='group',
    title = "Gender Count of Passengers by Class",
        xaxis = dict(title = 'Class'),
        yaxis = dict(title = 'Count')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)