# Introduction

This kernel is about a ferry disaster. On September 27 1994 a ferry which is name Estonia set sail from Estonia to Stockholm. She departed at 19:00 and it carried 989 passengers and crews. A Mayday signal was sent but power failure meant the ship’s position was given imprecisely. Unfortunately, the Estonia disappeared from the responding ships’ radar screens at about 01:50. The final death toll more than 850 people. 

We will look for answers to the following questions using exploratory data analysis.

* Who's more likely to survive the sinking based on data?
* Is age an indicator for survival?
* Is gender an indicator for survival?
* Did the crew aboard have a higher chance of survival than passengers?

<font color='blue'>
Content
    
1. [Read Data and PreCheck](#1)
1. [Variable Description](#2)
   * [Univariate Variable Analysis](#3)
       * [Categorical Variable](#4)
       * [Numerical Variable](#5)
1. [Basic Data Analaysis](#6)
1. [Visualization](#7)
   * [Category -- Survived](#8)
   * [Sex -- Survived](#9)
   * [Age -- Survived](#10)
   * [Country -- Category -- Survived](#11)
   * [Sex -- Category -- Survived](#12)
   * [Sex -- Country -- Survived](#13)
1. [Conclusion](#14)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

        
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id='1'></a>
# Read Data and PreCheck 

In [None]:
df = pd.read_csv("../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv")
df.tail()

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
df.isnull().sum()

<a id = '2'></a>
# Variable Description

* **Country:** Country of origin	
* **Firstname:** Firstname of passenger	
* **Lastname:**	Lastname of passenger	
* **Sex:** Gender of passenger	(M = Male, F = Female)
* **Age:** Age of passenger at the time of sinking	
* **Category:**	The type of passenger	(C = Crew, P = Passenger)
* **Survived:**	Survival (0 = No, 1 = Yes)

In [None]:
df.info()

 * **object(5):** Country, Firstname, Lastname, Sex, Category
 * **int64(2):** PassangerId, Age, Survived
 * PassengerId is unnessary. We can drop it.

In [None]:
df.drop(["PassengerId"],axis=1,inplace = True)

<a id='3'></a>
## Univariate Variable Analysis

* **Categorical Variables:** Country, Firstname, Lastname, Sex, Category, Survived
* **Numerical Variables:** Age

<a id='4'></a>
### Categorical Variables

In [None]:
def bar_plot(feature,figsize = (22,5)):
    
    value = df[feature].value_counts()
    
    plt.figure(figsize = figsize)
    plt.bar(value.index,value.values)
    plt.ylabel("Frequency")
    plt.xlabel(feature)
    plt.title("Distribution of " +str(feature))
    plt.show()
    
    print(value)

In [None]:
categorical_features = ["Country", "Sex", "Category", "Survived"]

for c in categorical_features:
    bar_plot(c)

<a id='5'></a>
### Numerical Variables


In [None]:
plt.figure(figsize = (9,5))
plt.hist(df.Age,bins = 50)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")
plt.show()


<a id='6'></a>
# Basic Data Analysis

* Country -- Survived
* Category -- Survived
* Sex -- Survived
* Country -- Category -- Survived
* Sex -- Category -- Survived
* Sex -- Country -- Survived

In [None]:
# Country -- Survived
df_country_survived = df[["Country","Survived"]].groupby("Country").sum()
countries = df.Country.value_counts()

[(countries[countries.index == c].values[0] - df_country_survived[df_country_survived.index == c].values[0])[0] for c in df_country_survived.index]
df_country_survived["Dead"] = [(countries[countries.index == c].values[0] - df_country_survived[df_country_survived.index == c].values[0])[0] for c in df_country_survived.index]
df_country_survived["Mean_of_Survived"] = df[["Country","Survived"]].groupby("Country").mean()
df_country_survived

In [None]:
# Category -- Survived
df_category_survived = df[["Category","Survived"]].groupby("Category").sum()
number_of_c = df.Category[df.Category == "C"].value_counts().values[0]
number_of_p = df.Category[df.Category == "P"].value_counts().values[0]
df_category_survived["Dead"] = [(number_of_c - df_category_survived.Survived.values[0]),(number_of_p - df_category_survived.Survived.values[1])]
df_category_survived["Mean_of_Survived"] = df[["Category","Survived"]].groupby("Category").mean()
df_category_survived

In [None]:
# Sex -- Survived
df_sex_survived = df[["Sex","Survived"]].groupby("Sex").sum()
number_of_c = df.Category[df.Category == "C"].value_counts().values[0]
number_of_p = df.Category[df.Category == "P"].value_counts().values[0]
df_sex_survived["Dead"] = [(number_of_c - df_sex_survived.Survived.values[0]),(number_of_p - df_sex_survived.Survived.values[1])]
df_sex_survived["Mean_of_Survived"] = df[["Sex","Survived"]].groupby("Sex").mean()
df_sex_survived

In [None]:
#Country -- Category -- Survived
df.groupby(["Country","Category","Survived"]).size().reset_index(name = "Count")

In [None]:
#Sex -- Category -- Survived
df.groupby(["Sex","Category","Survived"]).size().reset_index(name = "Count")

In [None]:
#Sex -- Country -- Survived
df.groupby(["Sex","Country","Survived"]).size().reset_index(name = "Count")

<a id='7'></a>
# Visualization

<a id='8'></a>
## Category -- Survived

In [None]:
g = sns.catplot(x = "Category", y="Survived", kind = 'bar',data = df, size = 5)
g.set_ylabels("Survived Probability")
plt.show()

* The survival rate is higher for the crew.

<a id='9'></a>
## Sex -- Survived

In [None]:
g = sns.catplot(x = "Sex", y="Survived", kind = 'bar',data = df, size = 5)
g.set_ylabels("Survived Probability")
plt.show()

* The survival rate is higher for men.

<a id='10'></a>
## Age -- Survived

In [None]:
#df_age_categorical = df.copy()
df.Age = [0 if a < 10 
                          else 1 if a >= 10 | a < 20 
                          else 2 if a >= 20 | a < 30 
                          else 3 if a >= 30 | a < 40 
                          else 4 if a >= 40 | a < 50 
                          else 5 if a >= 50 | a < 60 
                          else 6 if a >= 60 | a < 70 
                          else 7 if a >= 70 | a < 80 
                          else 8 
                          for a in df.Age.values]



In [None]:
g = sns.catplot(x = "Age", y="Survived", kind = 'bar',data = df, size = 5,aspect = 3)
g.set_xticklabels(["0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-87"])
g.set_ylabels("Survived Probability")
plt.show()

* Passengers who are between 20 and 40 years old, have more chance to survive.
* No passengers between the ages of 70-79 and 0-9 have been saved.

<a id='11'></a>
## Country -- Category -- Survived

In [None]:
g = sns.catplot(x="Country", y="Survived", hue="Category", kind="bar", data=df,height = 5,aspect = 3);
g.set_ylabels("Survived Probability")
plt.show()

<a id='12'></a>
## Sex -- Category -- Survived

In [None]:
g = sns.catplot(x="Category", y="Survived", hue="Sex", kind="bar", data=df,height = 5,aspect = 3);
g.set_ylabels("Survived Probability")
plt.show()

<a id='13'></a>
## Sex -- Country -- Survived

In [None]:
g = sns.catplot(x="Country", y="Survived", hue="Sex", kind="bar", data=df,height = 5,aspect = 3);
g.set_ylabels("Survived Probability")
plt.show()

<a id = '14'></a>
# Conclusion

* Q1 - Who's more likely to survive the sinking based on data? <br/>
  A1 - Male crews more likely to survive the sinking based on data.
* Q2 - Is age an indicator for survival?<br/>
  A2 - People who are between 20-40 years old, more likely to survive the sinking based on data.
* Q3 - Is gender an indicator for survival?<br/>
  A3 - Men seem to be more likely to survive the stinging.
* Q4 - Did the crew aboard have a higher chance of survival than passengers?<br/>
  A4 - Yes,the crew aboard had a higher chance of survival than passengers