# Project: The 1994 Sinking of the MS Estonia

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
        <li><a href="#crewcorrect">Crew List Correction</a></li>
        <li><a href="#datacorrect">Data Type Correction</a></li>
    </ul>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In the early morning of September 28, 1994 just shortly after midnight, the seafaring vessel <i>MS Estonia</i> sank into the Baltic Sea amidst poor weather conditions. Passengers scrambled to the deck for their only chance at survival, but ultimately many met their fate trapped in the cabins as the ship capsized. Out of the 989 people (of which 803 were listed as passengers) onboard the ship, there were 852 deaths and 137 survivors.

In this analysis, I will be exploring the correlation between surviving the disaster (dependent variable) in relation to the person's age, gender and whether they were a crewmember or passenger (independent variables).

Was age a factor in survival? Did a greater proportion of crewmembers survive the disaster than passengers? These questions I will attempt to answer with my analysis. This dataset was provided by <a href="https://www.kaggle.com/christianlillelund/passenger-list-for-the-estonia-ferry-disaster">Christian Lillelund at Kaggle</a>.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from seaborn import axes_style
from IPython.display import display, HTML

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### General Properties

To begin the process, I will print the first few lines of the dataframe and check the datatypes for each of the columns. 

In [None]:
# Read CSV file into dataframe
passenger_df = pd.read_csv('../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv') 

# Display first few lines of dataframe
display(passenger_df.head())

# Display information and data types of dataframe
passenger_df.info()

Note that the PassengerId, Age, and Survived columns are showing as data type 'int64', while the rest of the columns are of data type 'object'.

PassengerId should be changed to an 'object' type since we are not performing mathematical operations on the column. It seems the PassengerId column is intended to be the row index so I will change it accordingly.

Survived would be better represented as a 'bool' type as we have only two possible values for the column: True (1) or False (0).

The passengers are differentiated by the Category column with the possible values 'C' (Crewmember) or 'P' (Passenger).

In [None]:
print(passenger_df.groupby('Category').count())

We can see that there are 193 crewmembers and 796 passengers on the list. Hold on, didn't I say there were 803 people categorized as passengers in the introduction?

As it turns out, several sources have conflicting reports on the amount of crewmembers onboard the ship when it sank. This is due to the fact that some crewmembers onboard the ship were either off-duty or just not accounted for, so therefore they were not recorded in the crew list.

The Joint Accident Investigation Commission of Estonia, Finland, and Sweden released a report in December 1997 that stated there was 186 crewmembers onboard. <a href="https://web.archive.org/web/20040626020651/http://www.onnettomuustutkinta.fi/estonia/chapt04.html#2">Link</a>

Estline, the shipping company that owned the MS Estonia at the time of the sinking, made available the official crew list of the MS Estonia. The list recorded the names of 187 crewmembers and their respective job titles. <a href="https://estline.ee/en/munsterroll-crewlist">Link</a>

Besides the official crew list by Estline, I could not locate many other lists detailing the individuals who were part of the crew at the time of the sinking. I stumbled upon a Blogger blog <i>Estonia disaster</i> by user "passante", which had a crew list that seemed consistent with the official Estline crew list. It also had the job titles listed in English and gave additional context on whether the crewmember was actually off-duty at the time. This list states there were 189 crewmembers onboard. <a href="http://estoniadisaster1994.blogspot.com/2017/02/estonia-crew-list.html">Link</a>

I will have to compare the dataset with the Estline provided crew list and perhaps crosscheck it with the Blogger list as well, then make adjustments to the categorization of the passengers accordingly. I will also add a Fullname column containing the combined first and last names of the passengers to make comparison easier.

<a id='crewcorrect'></a>
### Crew List Correction

In order to begin comparing the dataset with the Estline crew list, I started a spreadsheet in Google Sheets with the column names Lastname, Firstname, and Fullname. I then filled the spreadsheet with the names provided in the Estline crew list. The spreadsheet was then exported as a CSV file 'munsterroll-crewlist.csv' and imported into Jupyter as 'crew_df'. 

In [None]:
# Read CSV file into dataframe
crew_df = pd.read_csv('../input/estonia-crew-list/munsterroll-crewlist.csv')

# Print first few lines of dataframe
crew_df.head()

Next, I want to add the column Fullname to 'passenger_df' in order to make the comparison and merge with 'crew_df' easier.

In [None]:
# Concatenate the Firstname and Lastname columns
passenger_df['Fullname'] = passenger_df['Firstname'] + ' ' + passenger_df['Lastname']

# Print first few lines of dataframe
passenger_df.head()

Next, I need to filter the dataset to display only people categorized as 'C' for 'Crewmember'. I placed the filtered results in a new dataframe called 'passenger_df_crew'.

I also defined a new function to filter out dataframe rows by category and return the filtered dataframe.

In [None]:
# crewmembersonly is a temporary variable to filter the results for the passenger_df_crew dataframe
crewmembersonly = passenger_df['Category'] == 'C'
passenger_df_crew = passenger_df[crewmembersonly]

# Print first few lines of dataframe
passenger_df_crew.head()

In [None]:
# Filter rows in dataframe by the Category column, df is dataframe and category is the string value for the column
def filter_by_category(df, category):
    filterrows = df['Category'] == category
    filtered = df[filterrows]
    
    return filtered

I then outer merge the 'passenger_df_crew' dataframe with the 'crew_df' dataframe on Fullname and place the result in the new dataframe 'crewoutermerge'. I want to display all rows without it being truncated, so I set the display option for max rows to 'None' (no limit).

In [None]:
# Set to display all rows without truncating
pd.set_option('display.max_rows', None)

# Merge the two dataframes on the Fullname column and display all rows even if they are not shared
crewoutermerge = passenger_df_crew.merge(crew_df, on='Fullname', how='outer')

# Display entire merged dataframe
crewoutermerge

Scrolling towards the bottom of the dataframe we can see a lot of NaN cells. This is because the CSV file I created from the Estline crew list did not contain columns for PassengerId, Country, Sex, Age, Category, and Survived. For easier comparison, I will go ahead and drop the Country, Sex, Age, Category, and Survived columns for 'crewoutermerge'. I am choosing to keep the PassengerId column so I can identify the rows needing to be edited as needed in the original dataset dataframe.

In [None]:
# Drop the unnecessary columns and store the result in a new dataframe
crewoutermerge = crewoutermerge.drop(columns=['Country','Sex','Age','Category','Survived'])

# Print the first few lines of the dataframe
crewoutermerge.head()

Now I want to filter all of the rows that contain NaN for any of the columns. NaN values in the 'Firstname_y' and 'Lastname_y' columns tell me that these are values not found in the CSV file I created but are present in the original dataset. NaN values in the 'Firstname_x' and 'Lastname_x' columns tell me that these are values not found in the original dataset but are present in my created CSV file.

In [None]:
# Filter for any rows displaying a NaN value in any column
crewoutermerge = crewoutermerge[crewoutermerge.isna().any(axis=1)]

crewoutermerge

It turns out a lot of the exclusive rows between the two dataframes are due to different spellings of names in the original dataset. You can see this with 'PETER TYYR' in the original dataset and 'PEETER TUUR' in my created CSV file. Another example is 'LARS MAGNUS ANDERSSON' in the original dataset and 'MAGNUS ANDERSSON' in my created CSV file.

Before I make the corrections, I'm going to go ahead and store my above steps for merging and dropping columns in a function since I will be using the same procedure a couple more times to make comparisons between the dataframes.

In [None]:
def merge_and_clean(df_left, df_right):
    merged_df = df_left.merge(df_right, on='Fullname', how='outer')
    merged_df = merged_df.drop(columns=['Country','Sex','Age','Category','Survived'])
    merged_df = merged_df[merged_df.isna().any(axis=1)]
    return merged_df

I corrected the name spellings in my CSV file to match the original dataset's spellings and saved it as a new CSV file, 'munsterroll-crewlist-corrected.csv', which I stored into the dataframe 'correctedcrew_df'. Now we can see the original dataset's spelling of 'PETER TYYR' reflect in my corrected CSV file.

In [None]:
# Read CSV file into dataframe
correctedcrew_df = pd.read_csv('../input/estonia-crew-list/munsterroll-crewlist-corrected.csv')

# Example of different spelling
correctedcrew_df.loc[correctedcrew_df['Fullname'] == 'PETER TYYR']

Along with correcting the spellings in the CSV file, I went ahead and made the following changes:
<ul>
<li>Deleted ANNE ROOSIPOLD and LEA OTS from the CSV file. They are not even present in the original dataset, which leads me to believe that while they were recorded in the crew list, they were not actually present onboard the ship when it sank.</li>
<li>Added JURI AAVIK, SIRJE KANTER, and SUSANNE PUNDI to the CSV file. They are listed as crewmembers in the original dataset but were not present in the Estline crew list. I checked the <i>Estonia disaster</i> blog and they were either off-duty or trainees, and were not recorded in the Estline crew list. Going forward, I am including the off-duty crewmembers as part of the total crew count.</li>
</ul>

So after correcting the spellings to match the original dataset, who's left? I will use my function I created earlier to merge the 'passenger_df_crew' dataframe with the corrected CSV file.

In [None]:
# Invoke the merge_and_clean function and store the results in the crewoutermerge dataframe
crewoutermerge = merge_and_clean(passenger_df_crew, correctedcrew_df)

display(crewoutermerge)

I can clearly identify a typo: the first name for PassengerId 712 is listed as 'TAAV!' when it should be 'TAAVI' as listed in my CSV file.

When I attempt to locate the passenger using index 712, I am given PassengerId 713 instead.

In [None]:
passenger_df.loc[712]

As expected, when I use index 711, I am given PassengerId 712.

In [None]:
passenger_df.loc[711]

I need to correct this issue by setting the index to PassengerId.

In [None]:
passenger_df = passenger_df.set_index('PassengerId')

Now when I search by index 712, I am given PassengerId 712.

In [None]:
passenger_df.loc[712]

I will need to correct the typo for both the Firstname and Fullname columns.

In [None]:
# Change values using index 712 for both Firstname and Fullname
passenger_df.at[712, 'Firstname'] = 'TAAVI'
passenger_df.at[712, 'Fullname'] = 'TAAVI RABA'
passenger_df.loc[712]

With typo corrected, I checked the rest of names that were exclusive to the original dataset's crew list. I determined these names were not in the Estline crew list or the Blogger list, but a Google search for ANNELI METSALLIK gave me this <a href="https://www.delfi.ee/archive/delfi-fotod-ja-video-estoniaga-kaks-tutart-kaotanud-isa-tiit-metsallik-lehvitasin-kodust-lahkuvatele-lastele-koogiaknast?id=69820105">article</a>. The article is in Estonian, but translated to English it states ANNELI was actually an off-duty crewmember, so I will leave her status as 'C' in accordance to my rule of keeping off-duty crewmembers in the crew count. I will go ahead and change the rest of them to Category 'P' using their given PassengerId.

In [None]:
# Change the Category value at the provided PassengerId index
passenger_df.at[474, 'Category'] = 'P'
passenger_df.at[485, 'Category'] = 'P'
passenger_df.at[536, 'Category'] = 'P'
passenger_df.at[537, 'Category'] = 'P'
passenger_df.at[683, 'Category'] = 'P'
passenger_df.at[712, 'Category'] = 'P'
passenger_df.at[728, 'Category'] = 'P'
passenger_df.at[754, 'Category'] = 'P'
passenger_df.at[760, 'Category'] = 'P'
passenger_df.at[860, 'Category'] = 'P'

With that determined, I reran the code to initialize the 'passenger_df_crew' dataframe to reflect the changes. I then merged the 'passenger_df_crew' and 'correctedcrew_df' using the merge_and_clean function to display the remaining names that were exclusive to my CSV file.

In [None]:
passenger_df_crew = filter_by_category(passenger_df, 'C')

crewoutermerge = merge_and_clean(passenger_df_crew, correctedcrew_df)

display(crewoutermerge)

After crosschecking with the Blogger list for more context, I determined these people were indeed crewmembers.

Before I correct their Category from 'P' to 'C', I need to see if they were in the original dataset first by merging the 'passenger_df' and 'crewoutermergeclean_na' dataframes.

In [None]:
passenger_df.merge(crewoutermerge, on='Fullname')

Since they are indeed in the original dataset, I will then locate each of their PassengerId values.

In [None]:
display(passenger_df.loc[passenger_df['Fullname'] == 'ADAM HALTER'])
display(passenger_df.loc[passenger_df['Fullname'] == 'MAIGA JARVI'])
display(passenger_df.loc[passenger_df['Fullname'] == 'KRISTA KOOP'])
display(passenger_df.loc[passenger_df['Fullname'] == 'AULIS LEE'])
display(passenger_df.loc[passenger_df['Fullname'] == 'HELIN PAEORG'])
display(passenger_df.loc[passenger_df['Fullname'] == 'TAAVI RABA'])

Now I can make the changes to their Category from 'P' to 'C' using their PassengerId values.

In [None]:
# Change the value for Category for the rows fetched using index
passenger_df.at[230, 'Category'] = 'C'
passenger_df.at[324, 'Category'] = 'C'
passenger_df.at[443, 'Category'] = 'C'
passenger_df.at[470, 'Category'] = 'C'
passenger_df.at[646, 'Category'] = 'C'
passenger_df.at[712, 'Category'] = 'C'

In [None]:
# Display changed Category values of passengers
display(passenger_df.loc[passenger_df['Fullname'] == 'ADAM HALTER'])
display(passenger_df.loc[passenger_df['Fullname'] == 'MAIGA JARVI'])
display(passenger_df.loc[passenger_df['Fullname'] == 'KRISTA KOOP'])
display(passenger_df.loc[passenger_df['Fullname'] == 'AULIS LEE'])
display(passenger_df.loc[passenger_df['Fullname'] == 'HELIN PAEORG'])
display(passenger_df.loc[passenger_df['Fullname'] == 'TAAVI RABA'])

With those crew list changes made, what does the count look like now?

In [None]:
passenger_df.groupby('Category').count()

The crew count is at 189, but since I am going with the <i>Estonia disaster</i> blog's total of 189 which includes off-duty crewmembers, this is accurate and desired.

Now that I am done with the crew list corrections, I will go ahead and drop the Fullname column since I was mainly using it for merging.

In [None]:
passenger_df = passenger_df.drop(columns='Fullname')
passenger_df.head()

<a id='datacorrect'></a>
### Data Type Correction

I will go ahead and display the dataframe's data types again so I can see which changes I need to make.

In [None]:
passenger_df.info()

Since I used PassengerId as the row index, it is no longer listed as one of the column names. The only change I will have to make is the Survived data type from 'int64' to 'bool'.

In [None]:
# Cast the Survived column as bool data type
passenger_df['Survived'] = passenger_df['Survived'].astype(bool)
passenger_df.info()

With the data type changed to 'bool', it now displays True and False for values 1 and 0 respectively.

In [None]:
passenger_df.tail()

<a id='eda'></a>
## Exploratory Data Analysis



### Was age a factor in survival?

First I will create a histogram based on the total distribution of ages, along with some measures of central tendency. I defined the function to keep my histogram style consistent and I am able to set the upper and lower values for the x and y-axis by plugging in the parameters of the function.

In [None]:
plt.rcParams["patch.force_edgecolor"] = True # Adds dark edges to plotted charts

# The data parameter is the dataframe column; xlim1, xlim2 and ylim1, ylim2 are my upper and lower values for the x and y-axis.
def styled_histogram(data, xlim1, xlim2, ylim1, ylim2):
    with axes_style({'axes.grid': True}):
        sns.distplot(data, kde=False).set(xlim=(xlim1, xlim2), ylim=(ylim1, ylim2))

styled_histogram(passenger_df['Age'], 0, 87, 0, 130)

# Define median and mode since they are not part of the describe() function
median = passenger_df['Age'].median()
mode = passenger_df['Age'].mode()

# Count the number of rows attached to each modal number
modecount = passenger_df[passenger_df['Age'] == mode.iloc[0]].shape[0]

print(passenger_df['Age'].describe()) # Print descriptive statistics including count, mean, std deviation, min/max, percentiles
print('median', median)
print('mode', mode.iloc[0], ',', mode.iloc[1], '(', modecount, 'each )')

From the histogram we can see a normal distribution of ages, with the mean age being about 44.58. The median age was 44, which means half of the passengers were older than 44. The youngest passenger was less than a year old, while the oldest passenger was 87 years old. The modal ages are 21 and 45, which means 27 passengers were 21 and 27 passengers were 45.

Additionally I defined a function mode_counter which uses my above steps in outputting the quantity of rows that attached to each modal number.

In [None]:
# Returns the number of rows sharing the modal number
# df is the dataframe, mode is the series returned by the mode() function
def mode_counter(df, mode):
    modecount = df[df['Age'] == mode.iloc[0]].shape[0]
    return modecount

Next, I will create a histogram based on the distribution of ages amongst survivors.

In [None]:
styled_histogram(passenger_df[passenger_df['Survived']==True].Age, 0, 87, 0, 120)

# Define median and mode since they are not part of the describe() function
median = passenger_df[passenger_df['Survived']==True].Age.median()
mode = passenger_df[passenger_df['Survived']==True].Age.mode()

# First filtering modecount rows that only meet the criteria of 'Survived' == True, then
# plug the result into the mode_counter function
modecount = passenger_df[passenger_df['Survived']==True]
modecount = mode_counter(modecount, mode)

print(passenger_df[passenger_df['Survived']==True].describe())
print('median', median)
print('mode', mode.iloc[0], ',', mode.iloc[1], ',', mode.iloc[2], '(', modecount,'each )')

The mean age is about 34.01 years old. Median is 32 years old. The modal ages are 21, 23, and 43, with 8 survivors each. The youngest survivor age is 12 years old while the oldest is 67 years old.

In [None]:
styled_histogram(passenger_df[passenger_df['Survived']==False].Age, 0, 87, 0, 120)

# Define median and mode since they are not part of the describe() function.
median = passenger_df[passenger_df['Survived']==False].Age.median()
mode = passenger_df[passenger_df['Survived']==False].Age.mode()

# First filtering modecount rows that only meet the criteria of 'Survived' == False, then
# plug the result into the mode_counter function
modecount = passenger_df[passenger_df['Survived']==False]
modecount = mode_counter(modecount, mode)

print(passenger_df[passenger_df['Survived']==False].describe())
print('median', median)
print('mode', mode.iloc[0], ',', mode.iloc[1], '(', modecount, ')')

The mean age is about 46.27 years old. Median is 46.5 years old. The modal ages are 49 and 67 with 24 victims each. The youngest victim age is 0 years old while the oldest is 87 years old.

Here are the above two histograms overlaid together, showing the difference in the distribution of death and survival by age.

In [None]:
# Disable edge coloring for overlay cleanliness
plt.rcParams["patch.force_edgecolor"] = False

with axes_style({'axes.grid': True}):
    ax = sns.distplot(passenger_df[passenger_df['Survived']==False].Age, kde=False)
    ax = sns.distplot(passenger_df[passenger_df['Survived']==True].Age, kde=False)
    plt.xlim([0,87])
    plt.ylim([0,120])
    ax.legend(['Died','Survived'])

Lastly, I created a logistic regression plot to show a line of best fit at a downward trend in survival as age increases.

In [None]:
sns.lmplot('Age','Survived',data=passenger_df,logistic=True)

### Did a greater proportion of crewmembers survive than passengers?

To start answering this question, let's look at the distribution of crewmembers and passengers onboard the ship. I created a function row_count to assist with counting rows filtered by column value, and another function find_prop to find the rounded proportion of one value out of the total.

In [None]:
plt.rcParams["patch.force_edgecolor"] = True # Adds dark edges to plotted charts

# Create a countplot with x-axis labels Passenger and Crewmember
with axes_style({'axes.grid': True}):
    sns.countplot(passenger_df['Category']).set_xticklabels(['Passenger','Crewmember'])

# df is the dataframe, column is the desired column to filter by and value is the desired column value to filter by
def row_count(df, column, value):
    return df[df[column] == value].shape[0]

# count1 is the value to calculate the proportion with, count2 just makes up the other half of the total
def find_prop(count1, count2):
    return np.around(count1 / (count1 + count2), 2)
    
pcount = row_count(passenger_df, 'Category', 'P') # Count the number of rows with Category 'P'
ccount =  row_count(passenger_df, 'Category', 'C') # Count the number of rows with Category 'C'

pprop = find_prop(pcount, ccount) # Calculate the proportion of rows with Category 'P'
cprop = find_prop(ccount, pcount) # Calculate the proportion of rows with Category 'C'

print(pcount, 'passengers (', pprop, ')')
print(ccount, 'crewmembers (', cprop, ')')

Out of the 989 people on the ship there were 800 passengers and 189 crewmembers, or 81% passengers and 19% crewmembers. Additionally I further broke down this distribution of passengers and crewmembers by gender for their respective categories. Since I am filtering rows to meet two criteria, I defined function adv_row_count which just adds a second set of parameters to accommodate the second criteria. I also defined function output_stats to save myself a lot of copy-pasting to output the total counts and proportions. 

In [None]:
# Added hue to countplot to separate each Category by Sex
with axes_style({'axes.grid': True}):
        ax = sns.countplot(passenger_df['Category'], hue = passenger_df['Sex'],
                          palette = 'Set2')
        ax.set_xticklabels(['Passenger', 'Crewmember'])
        ax.set_xlabel('Total count by Category')

# df is the dataframe, column1 and value1 are for the first criteria, while column2 and value2 are for the second criteria
def adv_row_count(df, column1, value1, column2, value2):
    return df[(df[column1] == value1) & (df[column2] == value2)].shape[0]
    
mpcount = adv_row_count(passenger_df, 'Category', 'P', 'Sex', 'M') # Row count of male passengers
fpcount = adv_row_count(passenger_df, 'Category', 'P', 'Sex', 'F') # Row count of female passengers
mpprop = find_prop(mpcount, fpcount) # Proportion of male passengers
fpprop = find_prop(fpcount, mpcount) # Proportion of female passengers

mccount = adv_row_count(passenger_df, 'Category', 'C', 'Sex', 'M') # Row count of male crewmembers
fccount = adv_row_count(passenger_df, 'Category', 'C', 'Sex', 'F') # Row count of female crewmembers
mcprop = find_prop(mccount, fccount) # Proportion of male crewmembers
fcprop = find_prop(fccount, mccount) # Proportion of female crewmembers

# title is the top label, label1 with count1 and prop1 in the middle, then label2 with count2 and prop2 in the bottom
def output_stats(title, label1, count1, prop1, label2, count2, prop2):
    print(title)
    print(label1, count1, '(', prop1, ')')
    print(label2, count2,' (', prop2, ')')

output_stats('Passengers', 'Male:', mpcount, mpprop, 'Female:', fpcount, fpprop)
print('\n')
output_stats('Crewmembers', 'Male:', mccount, mcprop, 'Female:', fccount, fcprop)

There is an almost even distribution of male and female amongst both categories, with Passengers leaning more male (52%) than female (48%) and Crewmembers leaning more female (55%) than male (45%). Next I will determine the total distribution of deceased and survivors by Category.

In [None]:
with axes_style({'axes.grid': True}):
    ax = sns.countplot(passenger_df['Category'], hue = passenger_df['Survived'],
                 palette = 'Set1')
    ax.set_xticklabels(['Passenger','Crewmember'])
    ax.set_xlabel('Total living status by Category')
    ax.legend(['Died','Survived'])
    
pdeath_count = adv_row_count(passenger_df, 'Category', 'P', 'Survived', False) # Row count of passenger deaths
psurvive_count = adv_row_count(passenger_df, 'Category', 'P', 'Survived', True) # Row count of passenger survivors
pdeath_prop = find_prop(pdeath_count, psurvive_count) # Proportion of passenger deaths
psurvive_prop = find_prop(psurvive_count, pdeath_count) # Proportion of passenger survivors

cdeath_count = adv_row_count(passenger_df, 'Category', 'C', 'Survived', False) # Row count of crew deaths 
csurvive_count = adv_row_count(passenger_df, 'Category', 'C', 'Survived', True) # Row count of crew survivors
cdeath_prop = find_prop(cdeath_count, csurvive_count) # Proportion of crew deaths
csurvive_prop = find_prop(csurvive_count, cdeath_count) # Proportion of crew survivors
        
output_stats('Passengers', 'Deceased:', pdeath_count, pdeath_prop, 'Survivors:', psurvive_count, psurvive_prop)
print('\n')
output_stats('Crewmembers', 'Deceased:', cdeath_count, cdeath_prop, 'Survivors:', csurvive_count, csurvive_prop)

It seems that indeed a greater proportion of Crewmembers survived (22%) than Passengers (12%). I further broke down the deaths and survivors for each Category by gender.

In [None]:
# Filter results for Category == C
passenger_df_crew = passenger_df[passenger_df['Category'] == 'C']

# Further split into living status
crew_deaths = passenger_df_crew[passenger_df_crew['Survived'] == False]
crew_survived = passenger_df_crew[passenger_df_crew['Survived'] == True]

fdeath_count = row_count(crew_deaths, 'Sex', 'F') # Row count for female deaths
fsurvive_count = row_count(crew_survived, 'Sex', 'F') # Row count for female survivors
mdeath_count = row_count(crew_deaths, 'Sex', 'M') # Row count for male deaths
msurvive_count = row_count(crew_survived, 'Sex', 'M') # Row count for male survivors

fdeath_prop = find_prop(fdeath_count, fsurvive_count) # Proportion of female deaths
fsurvive_prop = find_prop(fsurvive_count, fdeath_count) # Proportion of female survivors
mdeath_prop = find_prop(mdeath_count, msurvive_count) # Proportion of male deaths
msurvive_prop = find_prop(msurvive_count, mdeath_count) # Proportion of male survivors

output_stats('Female Crewmembers', 'Deaths:', fdeath_count, fdeath_prop, 'Survivors', fsurvive_count, fsurvive_prop)
print('\n')
output_stats('Male Crewmembers', 'Deaths:', mdeath_count, mdeath_prop, 'Survivors', msurvive_count, msurvive_prop)

with axes_style({'axes.grid': True}):
    ax = sns.countplot(passenger_df_crew['Sex'], hue = passenger_df_crew['Survived'], palette = 'Set1')
    ax.set_xlabel('Crew living status by gender')
    ax.set_ylim([0,400])
    ax.legend(['Died','Survived'])

In [None]:
# Filter results for Category == P
passenger_df_excludecrew = passenger_df[passenger_df['Category'] == 'P']

# Further split into living status
passenger_deaths = passenger_df_excludecrew[passenger_df_excludecrew['Survived'] == False]
passenger_survived = passenger_df_excludecrew[passenger_df_excludecrew['Survived'] == True]

fdeath_count = row_count(passenger_deaths, 'Sex', 'F')
fsurvive_count = row_count(passenger_survived, 'Sex', 'F')
mdeath_count = row_count(passenger_deaths, 'Sex', 'M')
msurvive_count = row_count(passenger_survived, 'Sex', 'M')

fdeath_prop = find_prop(fdeath_count, fsurvive_count)
fsurvive_prop = find_prop(fsurvive_count, fdeath_count)
mdeath_prop = find_prop(mdeath_count, msurvive_count)
msurvive_prop = find_prop(msurvive_count, mdeath_count)

output_stats('Female Passengers', 'Deaths:', fdeath_count, fdeath_prop, 'Survivors', fsurvive_count, fsurvive_prop)
print('\n')
output_stats('Male Passengers', 'Deaths:', mdeath_count, mdeath_prop, 'Survivors', msurvive_count, msurvive_prop)

with axes_style({'axes.grid': True}):
    ax = sns.countplot(passenger_df_excludecrew['Sex'], hue = passenger_df_excludecrew['Survived'],
                       palette = 'Set1', order=['F','M'])
    ax.set_xlabel('Passenger living status by Gender')
    ax.set_ylim([0,400])
    ax.legend(['Died','Survived'])

More males survived for both the Passenger and Crewmember categories; it is especially disproportionate in the Passenger category.

<a id='conclusions'></a>
## Conclusions

I found that the average age of the 137 survivors was 34.01 years old, while the average age of the 852 deceased was 46.27 years old. There were no survivors under 12 years old, and none older than 67 years old. There was a downward trend in survivors as age increased in a normal distribution of ages. A greater proportion of crewmembers survived, with 22% of the crew and only 12% of passengers surviving. Furthermore, a greater proportion of males survived in both categories, with 34% of male crewmembers and 19% of male passengers surviving. Only 12% of female crewmembers and 4% of female passengers survived.

My shortcomings included lack of knowledge of individuals who were able to board the lifeboats/liferafts deployed to increase chance of survival, and the exact location of individuals and their cabins at the time of the disaster. Since the ship was listing on the starboard side, the individual's cabin being on the port or starboard side could have affected the chances of escaping.