# **Introduction**
![](https://cdn.fishki.net/upload/post/2019/09/28/3099754/tn/50000.jpg)
On September 27 1994 the ferry Estonia set sail on a night voyage across the Baltic Sea from the port of Tallin in Estonia to Stockholm. She departed at 19.00 carrying 989 passengers and crew, as well as vehicles, and was due to dock at 09.30 the following morning, Tragically, the Estonia never arrived.

The weather was typically stormy for the time of year but, like all the other scheduled ferries on that day, the Estonia set off as usual. At roughly 01:00 a worrying sound of screeching metal was heard, but an immediate inspection of the bow visor showed nothing untoward. The ship suddenly listed 15 minutes later and soon alarms were sounding, including the lifeboat alarm. Shortly afterwards the Estonia rolled drastically to starboard. Those who had reached the decks had a chance of survival but those who had not were doomed as the angled corridors had become death traps. A Mayday signal was sent but power failure meant the ship’s position was given imprecisely. The Estonia disappeared from the responding ships’ radar screens at about 01:50.

The Marietta arrived at the scene at 02:12 and the first helicopter at 03:05. Of the 138 people rescued alive, one died later in hospital.

Of the 310 people who had reached the decks, almost a third died of hypothermia. The final death toll was shockingly high – more than 850 people.

An official inquiry found that failure of the locks on the bow visor, which broke away under the punishing waves, caused water to flood the car deck and quickly capsize the ship. The report also noted a lack of action, delay in sounding the alarm, lack of guidance from the bridge and a failure to light distress flares.

The sinking of the Estonia was Europe’s worst postwar maritime disaster.

Questions:
1. **Who's more likely to survive the sinking based on data?**
2. **Is age an indicator for survival?**
3. **Is gender an indicator for survival?**
4. **Did the crew aboard have a higher chance of survival than passengers?**

Thank you artem1337 for some ideas for data visualization. 

Link to cernel: https://www.kaggle.com/artem1337/data-visualization

Importing the necessary libraries.

In [None]:
import pandas as pd 
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('dark')
print('Setup complete')

Below, I define a function for easy visualization of histograms.

In [None]:
def cust_hist(hist_obj, xlabel=None, ylabel=None, title=None):
    _, ax = hist_obj
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
    if title:
        ax.set_title(title)

Let's read the data and look at the information about the dataset.

In [None]:
path = '..//input//passenger-list-for-the-estonia-ferry-disaster//estonia-passenger-list.csv'

In [None]:
df = pd.read_csv(path, index_col='PassengerId')
df.head(5)

In [None]:
df.info()

Let's see how many people and from which countries were present on the ship at the time of the crash.

In [None]:
print('The number of passengers from different countries:')
print(df.Country.value_counts())

fig, _ = plt.subplots(figsize=(10,6))
ax = sns.barplot(x=df['Country'].value_counts().keys(),
            y=df['Country'].value_counts().values)
ax.set_xticklabels(ax.get_xticklabels(), rotation=35)
ax.set_title('The number of passengers from different countries:')
plt.show()

Most of the passengers were divided into 3 groups: Swedes, Estonians and others.

Most of the passengers were divided into 3 groups: Swedes, Estonians and others. This is quite normal, because the ferry was on its way from Tallinn to Stockholm. We also note that there are relatively many citizens of Latvia and Finland, neighboring countries of Northern Europe.

# **Answer the question: Is age an indicator for survival?**

First, let's look at the age distribution of people who were on the ferry and the distribution of the ages of the dead and survivors.

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(14,8))
axarr[0].set_title('Age distribution')
f = sns.distplot(df['Age'], color='g', bins=15, ax=axarr[0])
axarr[1].set_title('Age distribution for the two subpopulations')
g = sns.kdeplot(df['Age'].loc[df['Survived'] == 1], 
                shade= True, ax=axarr[1], label='Survived').set_xlabel('Age')
g = sns.kdeplot(df['Age'].loc[df['Survived'] == 0], 
                shade=True, ax=axarr[1], label='Not Survived')

The overall age distribution is approximately normal, with minor statistical deviations in the direction of bimodal. This means that we can apply the student's t-test to test the statistical hypothesis that the ages of survivors and victims are equal. 

The second graph hints that this hypothesis will be rejected, but we will still check it using the method implemented in the stats package.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Age distribution for the two subpopulations')
g = sns.kdeplot(df['Age'].loc[df['Survived'] == 1], 
                shade= True, ax=ax, label='Survived').set_xlabel('Age')
g = sns.kdeplot(df['Age'].loc[df['Survived'] == 0], 
                shade=True, ax=ax, label='Not Survived')
ax.grid()

Let's see how different the age distributions of men and women are.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Age distribution of men and women.')
g = sns.kdeplot(df['Age'].loc[df['Sex'] == 'M'], 
                shade= True, ax=ax, label='Male').set_xlabel('Age')
g = sns.kdeplot(df['Age'].loc[df['Sex'] == 'F'], 
                shade=True, ax=ax, label='Female')
ax.grid()

As you can see, the distributions are quite similar to each other, except that the top of the distribution for women is slightly to the right, respectively, the median age for women on the ferry was slightly more than for men. 

Perhaps this fact is related to the average life span, chance, or other facts, we will find out later what may cause this small shift.

Let's test the hypothesis.

Creating masks to identify groups of survivors and victims. 

In [None]:
from scipy import stats

mask = df['Survived'] == 1
x = df['Age'][mask]

mask = df['Survived'] == 0
y = df['Age'][mask]

Let's look at the statistics of survivors.

In [None]:
x.describe()

The average age value and median do not differ significantly, which suggests that the distribution is normal. From the interesting: the youngest survivor was at the age of 12 years, and the oldest 67 years.

In [None]:
y.describe()

A similar pattern is observed here. From the interesting: there Is 1 outlier in the form of a passenger with age 0 (Possibly a pregnant woman, or rather her child), the oldest passenger was aged 87 years.

Let's check whether age affects survival.

Take the null hypothesis H0: the Ages of the dead and survivors are equal.

Let's use the student's t-criterion, since the distributions are close to normal.

In [None]:
res = stats.ttest_ind(x, y, equal_var=False)
print('p-value:', res[1])

The p-value turned out to be too low, rather below the threshold of statistical significance. we can say with confidence that age is a significant factor for survival in this disaster.

# **Did the crew aboard have a higher chance of survival than passengers?**
# **Is gender an indicator for survival?**

To answer these questions, you need to study the statistics of survival depending on the gender of the category to which the person belonged.

Let's see how many people were female and male among the passengers and crew members.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
p = sns.countplot(x = 'Category', hue = 'Sex', data = df, 
                  ax=ax).set_title('Number of men and women in different categories')

Note that the passengers were numerically dominated by men, but there were more girls among the crew.

Let's look at the number of survivors among passengers and crew members.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
p = sns.countplot(x = 'Category', hue = 'Survived', data = df, 
                  ax=ax).set_title('Number of survivors among passengers and crew')

Passengers survived far more than crew members, but we need to look at the percentage of survivors in both categories to give a more accurate estimate of survival.

In [None]:
c_surv = df['Category'][(df['Category'] == 'C')& (df['Survived'] == 1)].count()
c_all = df['Category'][df['Category'] == 'C'].count()
p_surv = df['Category'][(df['Category'] == 'P') & (df['Survived'] == 1)].count()
p_all = df['Category'][df['Category'] == 'P'].count()
c_percent = c_surv / c_all * 100
p_percent = p_surv / p_all * 100
print(f'The percentage of survivors among the crew members: {round(c_percent, 3)}%')
print(f'The percentage of survivors among the passengers: {round(p_percent, 3)}%')

We see that by normalizing the histograms, we will get a survival difference of almost 2 times greater for the crew members. We can conclude that the crew members saved themselves and colleagues more actively than the passengers.

Let's look at more detailed statistics on both categories, namely the survival rate of men and women among passengers and crew of the ferry.

In [None]:
# Crew(C-category)
print('Statistics on crew')
f_surv = df['Sex'][(df['Sex'] == 'F') & (df['Category'] == 'C') & (df['Survived'] == 1)].count()
f_all = df['Sex'][(df['Sex'] == 'F') & (df['Category'] == 'C')].count()

m_surv = df['Sex'][(df['Sex'] == 'M') & (df['Category'] == 'C') & (df['Survived'] == 1)].count()
m_all = df['Sex'][(df['Sex'] == 'M') & (df['Category'] == 'C')].count()

f_percent = f_surv / f_all * 100
m_percent = m_surv / m_all * 100
print(f'Percentage of male survivors among crew members: {round(m_percent, 3)}%')
print(f'Percentage of female survivors among crew members: {round(f_percent, 3)}%')


# Passengers(P-category)
print('\nStatistics on passengers')

f_surv = df['Sex'][(df['Sex'] == 'F') & (df['Category'] == 'P') & (df['Survived'] == 1)].count()
f_all = df['Sex'][(df['Sex'] == 'F') & (df['Category'] == 'P')].count()

m_surv = df['Sex'][(df['Sex'] == 'M') & (df['Category'] == 'P') & (df['Survived'] == 1)].count()
m_all = df['Sex'][(df['Sex'] == 'M') & (df['Category'] == 'P')].count()

f_percent = f_surv / f_all * 100
m_percent = m_surv / m_all * 100
print(f'Percentage of male survivors among passengers members: {round(m_percent, 3)}%')
print(f'Percentage of female survivors among passengers members: {round(f_percent, 3)}%')

We see that among the crew members more survivors were among men - 33%. More men also survived among the passengers. The statistics are similar for women. Men were more concerned for their own lives than for the lives of women, and they were the first to escape.

![](https://avatars.mds.yandex.net/get-zen_doc/1110951/pub_5d9dcffe2fda8600b1530c07_5d9df12986c4a900b246e1ad/scale_1200)

In [None]:
all_peop = df['Sex'][df['Sex'] == 'F'].count()
surv = df['Sex'][(df['Survived'] == 1) & (df['Sex'] == 'F')].count()
percent = surv / all_peop
print(f'Percent of survived Female: {round(percent * 100, 3)}%')

all_peop = df['Sex'][df['Sex'] == 'M'].count()
surv = df['Sex'][(df['Survived'] == 1) & (df['Sex'] == 'M')].count()
percent = surv / all_peop
print(f'Percent of survived Male: {round(percent * 100, 3)}%')

The percentage of male survivors is much higher than the percentage of female survivors. We can confidently say that gender has a strong impact on survival. It can also be stated that the crew members had a better chance of saving their lives than the passengers.

![](http://images4.fanpop.com/image/photos/20400000/Captain-Jack-Sparrow-in-DMC-captain-jack-sparrow-20487552-998-424.jpg)

# **Who's more likely to survive the sinking based on data?**

Summing up, we can say that the greatest chance of survival was a man of 20-40 years old who was among the crew members. Note the following facts: 
1. Men are more likely to survive than women. 
2. The members of the crew more likely to survive than passengers. 
3. Young people more likely to survive than people in old age.

This concludes my analysis of this data set. Leave your comments and suggestions for improvement. 

It was a pleasure to share my thoughts with you! 

Good luck!