In [344]:
import pandas as pd
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Path of the file to read, 
# I solved the task on Kaggle, so you might want to provide the path to train.csv on your local machine to load the notebook correctly.
train_filepath = "../input/tabular-playground-series-apr-2021/train.csv"

df = pd.read_csv(train_filepath, index_col="PassengerId")

print("Finished setup, dataset is available")

# Context

Let's get a quick look into the context of the data set that we are planning to explore.

In [345]:
# We can have a look on first five rows in the dataframe
df.head()

In [346]:
# Let's also check last five rows in the dataframe
df.tail()

## What do the different columns mean

The following information is taken from the competition description and will give us some extra understanding what different columns represent and what values (keys) are expected in them.

| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| survival |                  Survival                  |                 0 = No, 1 = Yes                |
|  pclass  |                Ticket class                |            1 = 1st, 2 = 2nd, 3 = 3rd           |
|    sex   |                     Sex                    |                                                |
|    Age   |                Age in years                |                                                |
|   sibsp  | # of siblings / spouses aboard the Titanic |                                                |
|   parch  | # of parents / children aboard the Titanic |                                                |
|  ticket  |                Ticket number               |                                                |
|   fare   |               Passenger fare               |                                                |
|   cabin  |                Cabin number                |                                                |
| embarked |             Port of Embarkation            | C = Cherbourg, Q = Queenstown, S = Southampton |

# Data quality assessment

We start some simple assesments and will use the guidance from the following tutorial cheat sheet: https://github.com/cmawer/pycon-2017-eda-tutorial/blob/master/EDA-cheat-sheet.md

## Let's also get some overall info from Pandas about this dataframe

We start by gathering information about values in each numerical column by using **.describe()** function of pandas.

In [347]:
df.describe()

Now let's have a look into the columns.

In [348]:
df.info()

As we can see there are 10 000 entries in the dataset, and some columns contain NaNs. Let's get some visualisations for the columns with NaNs/null values so we have an overall understanding how bad data is in these columns.

### Let's start by investigating total count of NaN values in each column where not only non-nulls are present.

In [349]:
total = df.shape[0]

age_nan_number = df['Age'].isna().sum()
ticket_nan_number = df['Ticket'].isna().sum()
fare_nan_number = df['Fare'].isna().sum()
cabin_nan_number = df['Cabin'].isna().sum()
embarked_nan_number = df['Embarked'].isna().sum()

name_of_nan_contains_columns = ['Age', 'Ticket', 'Fare', 'Cabin', 'Embarked']
list_nans = [age_nan_number, ticket_nan_number, fare_nan_number,cabin_nan_number, embarked_nan_number]
list_value = [total - age_nan_number, total - ticket_nan_number, total - fare_nan_number, total - cabin_nan_number, total - embarked_nan_number]
df_nans = pd.DataFrame(list(zip(name_of_nan_contains_columns, list_nans, list_value)), columns =['Name','Nullable data', 'Non-empty data'])
plt.figure(figsize=(12,7))
sns.barplot(x='Name', y='Nullable data', data=df_nans)

Number of NaN values is extremely high for Cabin column (close to 70%), therefore we will remove this column from the dataset.

In [350]:
df.drop(columns="Cabin", axis=1, inplace=True)

## Duplicates
Full duplicates are not really valuable in the context of passenger/survivors, since each person is unique in real-world context, therefore duplicates can be safely deleted.

In [351]:
df = df.drop_duplicates()
df

Since no rows are dropped, we conclude that there are no duplicates in the dataframe.

## Random sample checks
Finally, we can run **df.sample()** in Pandas and try to spot something interesting in the random samples of data.

> Note, since when the command returns a new sample everytime it is run, I will describe anything interesting in the Markdown section below.

In [352]:
df.sample(10).sort_values(by=['Name'])

In [353]:
df[['Pclass','Fare']].sample(10).sort_values(by=['Fare'])

In [354]:
df.sample(10).sort_values(by=['Ticket'])

## Conclusions
Using quick review of samples we have concluded some interesting tendencies that might be interesting for further explorations:
* Pclass and Fare are not highly correlated, sometimes higher class costs less than the class below. Maybe it relates to the port of embarkation.
* Cabin column contains extremely high number of NaNs
* There are people with siblings and parents, but without cabin number. Unless there is a relation that can be build via Ticket number, this information is not really valuable for finding relationships between passengers.
* There are entries for people of relatively young age (below 8) that neither had parents or siblings onboarded with them.
* The dataset is not directly representing a real world case for Titanic, but rather a training material, because there is no cruise ship in the world that would fit 100 000 passengers. However, the dataset might have been created from the real world data from either multiple real-world datasets, which would also explain NaNs in many of the columns.

# Data exploration

We can start EDA with something we already did during data quality check, namely getting overview of mean, max, mode and other props for numerical columns.

In [355]:
df.describe()

Looking into the output of the describe function, we can make certain conclusion, like:
* Min. fare for a ticket was 0.68
* 75 percentale of passengers fall into 3rd class category
* Mean age of the passengers is around 38
* Maximum count of parents and children onboard was 9
* etc.

## Single variables and visualisation
We start by looking into singles columns of the dataset and visualizing them using Seaborn library.

## Age
Let's build a lineplot to see what was the overall distribution of passengers per age. Since during Data Quality Check we have already seen that this column contains NaNs, we can also drop them.

In [356]:
df_age = df['Age'].dropna().reset_index().groupby('Age').count()

plt.figure(figsize=(12,7))
sns.scatterplot(data = df_age)

As you see in the plot, there is a lot of dots in the bottom part of the plot as well as many dots around (0,0). The most likely reason behind these is that there are passengers with age represented not as a whole number, as well as some number of passengers with age from 0-1, that are grouped pretty much individually. In order to make our misualisation more consistent, let's round up the float numbers as well as set to 0 those that have the age between 0 and 1.

In [357]:
df.loc[df['Age'] < 1, 'Age'] = 0
df['Age'] = df['Age'].round(0)
df_rounded_age = df['Age'].dropna().reset_index().groupby('Age').count()
plt.figure(figsize=(12,7))
plt.title('Distribution of passenger_s age')
sns.scatterplot(data = df_rounded_age)

We can also use lineplot to make it more continuous.

In [358]:
plt.figure(figsize=(12,7))
sns.lineplot(data=df_rounded_age)

## Fare/Class
During data quality check we have discovered that some simple values for Fare in 1st class are below Fare in class 2/3. Now that we have a mean value we can look a bit into that and visualise the average fare per each passenger class.

In [359]:
plass_fare_mean = df.groupby('Pclass').agg({'Fare':'mean'}).reset_index()

plt.figure(figsize=(12,7))
plt.title('Average fare per Class')
sns.barplot(x=plass_fare_mean['Pclass'], y=plass_fare_mean['Fare']);

We can say that in average 1st class is appr. 4 times more expensive than the 2nd/3rd, while the fare difference between 2nd and 3rd is not that large.

### Filling in missing values in **Fare** column
As we recall from data quality check, **Fare** is one of the columns that had a small proportion of NaNs. Relying on the fact that class and fare have direct relation, we can add missing values to the Fares column based on the corresponding class.

In [360]:
df.loc[(df['Pclass'] == 1) & (df['Fare'].isnull()), 'Fare'] = df[(df['Pclass'] == 1) & (df['Fare'].notnull())]['Fare'].mean()
df.loc[(df['Pclass'] == 1) & (df['Fare'].isnull()), 'Fare'] = df[(df['Pclass'] == 1) & (df['Fare'].notnull())]['Fare'].mean()
df.loc[(df['Pclass'] == 1) & (df['Fare'].isnull()), 'Fare'] = df[(df['Pclass'] == 1) & (df['Fare'].notnull())]['Fare'].mean()

## Sex
Let's see overall number of male and female passengers.

In [361]:
df_sex = df[['Sex']].groupby(['Sex']).size()

plt.figure(figsize=(12,7))
plt.pie(df_sex, labels = ['Female', 'Male'], autopct='%.0f%%')

As we see there are slightly more men than women in this dataset.

## Survivals

In [362]:
df_surv = df[['Survived']].groupby(['Survived']).size()

plt.figure(figsize=(12,7))
plt.pie(df_surv, labels = ['Not survived', 'Survived'], autopct = '%.0f%%', colors = sns.color_palette('light:#5A9')[0:2])

## Exploring relashionships
Looking into the pure data is not so interisting as finding correlations and dependencies :)

We will be analysing this dataset with the goal to find correlations and relashionships that impacted the survival rate and we will visualize them using different plots from Seaborn.

### Number of survived passengers per age

In [363]:
df_age_surv = df.groupby(['Age'])['Survived'].agg(['sum'])

plt.figure(figsize=(12,12))
sns.heatmap(df_age_surv)

This heatmap shows the age of the survived people, but let's also have a look on the survival rate is represented across all of the passengers.

In [364]:
plt.figure(figsize=(15,7))

sns.violinplot(data = df, x = 'Age', y = 'Survived')

The violin plot above allows us to see the whole distribution of the survival rate across all of the ages, but since there are essentially passenger for each year between 0 and 87, it makes the graph a bit hard to read. 

A possible solution can be to group ages into broader categories, e.g., kids, adults and seniors, and then create a more readable plot for each of such categories.

In [365]:
df['Age group'] = df.loc[:, 'Age']
df['Age group'].iloc[df['Age'] < 18] = "Children"
df['Age group'].iloc[(df['Age'] >= 18) & (df['Age'] <= 60)] = "Adults"
df['Age group'].iloc[df['Age'] > 60] = "Seniors"

plt.figure(figsize=(15,7))

sns.barplot(data = df, x = 'Age group', y = 'Survived')

Now, we can state that survival rate for Senior passengers is the highest among all ages, however if we recall a heatmap from above, the number of survived passengers in this age group is relatevely small, compared to adults.

## Survival rate per sex
Let's find out what is the relation between survival and sex of the passenger using a simple barplot.

In [366]:
plt.figure(figsize=(15,7))

sns.barplot(data = df, x = 'Sex', y = 'Survived')

As we see in the plot above, the survival rate for women is much higher than fo male. It can be explained by a common rule, that during the evacuation women and childer, as well as seniors are prioritized. 

## Survival rate per Pclass
Let's see if getting a 1st class ticket would have increased the chances of survival using a barplot.

In [367]:
plt.figure(figsize=(15,7))

sns.barplot(data = df, x = 'Pclass', y = 'Survived')

The difference in survival rate between 1st class and 2nd class is not so significant, however chances of surviving with 3rd class tickets have been pretty low.

## Survival rate per class and sex
We saw that 1st and 2nd class ticket would increase passengers chances of survival as well as being women wouls also do that. But now let's check if a male passenger could have increased his chances for survival when buying a higher class ticket against women of any passenger class. For that we can use a poinplot diagram with hue.

In [368]:
plt.figure(figsize=(15,7))

sns.pointplot(data = df, x = 'Pclass', y = 'Survived', hue = 'Sex')

As shown in the plot above, being a man and having a higher class ticket would only increase chances of survival compared to other men with lower class tickets, but it still won't be high enough to overpass women with 3rd class ticket.

# Summary
Let's sum up some interesting findings that we ahve observed during the EDA of the given dataset:
* 43% of the passenger have survived
* Women have significantly higher survival rate than men, senior passengers and children have a higher survival rate than adults. Overall, the rule that women, children, and seniors are saved first is confirmed
* On average, paying extra money for 1st class significantly increases survival chances for men. For women, no significant difference between 1st and 2nd class survival rate, meaning that women could have saved some money
* The range for fare between 1st class and 2nd is much higher, than between 2nd and 3rd.