# Data Science 
## Titanic Worked Example 
Author: Andrew Szwec

<img src="./images/titanic.jpg" alt="Titanic" width=400 height=600>

<img src="./images/titanic_voyage.jpg" alt="Voyage" width=800>

## Hypothesis

**In the Titanic passenger dataset downloaded from https://github.com/generalassembly-studio/dat10syd/tree/master/data, containing data from 2/4/1912 to 14/4/1912, passengers in 1st class (Pclass==1) have a higher likelihood of survival than passengers in 3rd class (Pclass==3).**

<img src="./images/smart.png" alt="Smart" width=600>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from string import ascii_letters
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('titanic.csv')
df.head()

In [None]:
df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare', 'Embarked']]

## Count of Survived by passenger class

In [None]:
df.groupby(['Pclass', 'Survived']).count()

In [None]:
# Using Function
def survived_name(x):
    if x == 0:
        return 'Perished'
    else:
        return 'Survived'
    
survived_name('string')    

In [None]:
# Using Dictionary
class_dict = {1: '1st', 2: '2nd', 3: '3rd'}
    
df['SurvivedDesc'] = df['Survived'].apply(survived_name)
df['PclassDesc'] = df['Pclass'].apply(class_dict.get)

df.head()

In [None]:
df.groupby(['PclassDesc', 'SurvivedDesc']).count()['PassengerId'] \
    .plot(kind='bar', stacked=True, title='Count Survived by Pclass' \
          , legend=True, fontsize=15, colormap='spring')

This does not prove that you were more likely to survive in 3rd class. It just shows there were more people in 3rd class that did survive, but what were the volume of 3rd class passengers compared to 1st class?

In [None]:
# Percent by Volume
grp_class_surv = df.groupby(['PclassDesc', 'SurvivedDesc']).agg({'PassengerId': 'count'})
print(grp_class_surv)

# Calculate Percentage
grp_perc = grp_class_surv.groupby(level=0).apply(lambda x:  x / float(x.sum()) * 100)
# Print table
print(grp_perc)

ax = grp_perc.plot(kind='bar', stacked=True, title='Percent Survived by Pclass', legend=True, fontsize=15, colormap='autumn')

for p in ax.patches:
    ax.annotate(str(np.round(p.get_height(),0))+'%', (p.get_x() * 1.005, p.get_height() * 1.005))

In [None]:
grp_perc.index

# Lets look at Survived only

In [None]:
# Percent by Volume
grp_class_surv = df.groupby(['PclassDesc', 'SurvivedDesc']).agg({'PassengerId': 'count'})
# Calculate Percentage
grp_perc = grp_class_surv.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))


ax = grp_perc.plot(kind='bar', stacked=True, title='Percent Survived by Pclass', legend=True, fontsize=15, colormap='autumn')

for p in ax.patches:
    ax.annotate(str(np.round(p.get_height(),0))+'%', (p.get_x() * 1.005, p.get_height() * 1.005))

In [None]:
a = grp_perc.reset_index()[grp_perc.reset_index().SurvivedDesc == 'Survived'].set_index(['PclassDesc', 'SurvivedDesc'])

ax = a.plot(kind='bar', stacked=True, title='Percent Survived by Pclass', legend=True, fontsize=15, colormap='autumn')

for p in ax.patches:
    ax.annotate(str(np.round(p.get_height(),0))+'%', (p.get_x() * 1.005, p.get_height() * 1.005))

## Conclusion
After analysising the Titanic dataset the hypothesis has been proven true. The above figure shows that the likelihood of survival for a 1st class passenger is 63%, while the likelihood of a 3rd class passenger is 24%.


  


## Lets Explore this dataset further

In [None]:
grp = df.groupby('Sex').count()['Survived'].reset_index()
grp

## Pandas Plotting Cookbook
### Charts
https://pandas.pydata.org/pandas-docs/stable/visualization.html
### Plotting Options
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
## Colour Maps
https://matplotlib.org/users/colormaps.html

In [None]:
grp.set_index('Sex').plot.bar(title='Count of Survived', legend=True, fontsize=15, colormap='summer')

In [None]:
grp = df.groupby(['Sex', 'Pclass']).count()['Survived'].reset_index()
grp

In [None]:
df['Fare'].plot.hist(alpha=0.5, title='Distribution of Fare', legend=True, fontsize=15, colormap='winter')

In [None]:
df['Age'].plot.box( title='Distribution of Age', legend=True, fontsize=15, colormap='viridis')



In [None]:
df.groupby('Sex').mean()['Age'].plot(kind='bar', title='Mean of Age by Gender', legend=True, fontsize=15, colormap='viridis')


In [None]:
df.groupby('Pclass').mean()['Age'].plot(kind='bar', title='Mean Age by Passenger Class', legend=True, fontsize=15, colormap='summer')


In [None]:
df.groupby('SibSp').count()['Survived'].plot(kind='bar', title='Count Survived by Number Siblings/Spouse', legend=True, fontsize=15, colormap='inferno')

## Subset to the useful columns

In [None]:
dd = df[['Survived', 'SibSp', 'Pclass', 'Sex', 'Age', 'Parch', 'Fare', 'Embarked']]
dd.head()

## Correlation of key variables and Target

In [None]:
sns.set(style="white")

# Compute the correlation matrix
corr = dd.corr()

corr

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Export the dataset for later use

In [None]:
## Export the dataset for later use
dd.to_csv('titanic_preprocessed.csv')