# 0. Loading dataset

In [None]:
# Import pandas, matplotlib and seaborn libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set the parameters and the style for plotting
params = {'figure.figsize':(12,8),
         'axes.labelsize':13,
         'axes.titlesize':16,
         'xtick.labelsize':11,
         'ytick.labelsize':11
         }
plt.rcParams.update(params)
sns.set_style("whitegrid")

We will be again using the famous Titanic dataset to explore missing data, let's started by loading the dataset.

In [None]:
# Load the dataset 'Data/titanic_data.csv' and store it to variable data
data = pd.read_csv('Data/titanic_data.csv')
data

# 1. First look at the missing values

To detect missing values we can use Pandas `isnull()` method along with `.sum()`.   

In [None]:
# Get the summary of missing values using .isnull().sum()
data.isnull().sum()

We can see there are three columns which contain missing values: 'Age', 'Cabin' and 'Embarked'. If we want to compute the proportion of missing values we can use `.mean()` 

In [None]:
# Compute the proportion of missing values in dataset's columns
print(data.isnull().mean()*100)

If we want to visualize the location of missing values, we can use `seaborn's heatmap` that tell us where the missing value occur. We set paramater `cbar = False` as the colorbar don't need to be drawn in this case. Or we use basic barplot.

In [None]:
# Visualize missing values using heatmap
sns.heatmap(data.isnull(), cbar = False);

In [None]:
# TASK 1 >>> Choose only those three columns that contain missing values, assign it to variable data_copy
#            Visualize them using seaborn   

In [None]:
# Visualize missing values using barplot
data.isnull().sum().plot(kind = 'bar');

# 2. Concepts of missing values

According to Rubin's theory $^{1}$, every datapoint has some probability of being missing in the dataset. The process that governs these probabilities is called **the missing data mechanism**. 

## 2.1 MNAR: Missing data Not At Random

MNAR means that the probability of being missing varies for reasons that are unknown to us. Let's look at columns 'Age' and 'Cabin' in which passengers were traveling. We found out that the column 'Cabin' contain approximately 77% missing values, the column 'Age' almost 20% missing values. 
The age or a cabin could not be establish for people who did not survive that night. We assume that survivals were asked for such information. But can we infer this when we look at the data ? In this case, we expect that observations with people who did not survive should have more missing values. Let's find out.



In [None]:
# Filter the dataset based on people who survived
survived = data.query('Survived == 1')
survived

In [None]:
# Print the percentage of missing values in column 'Cab60.in' in case of survivals
print('The percentage of missing values: {0:.1f} %'.format(survived['Cabin'].isna().mean()*100))

In [None]:
# Filter the dataset based on people who did not survived
not_survived = data.query('Survived == 0')
not_survived

In [None]:
# Print the percentage of missing values in column 'Cabin' in case of people who didn't survive
print('The percentage of missing values: {0:.1f} %'.format(not_survived['Cabin'].isna().mean()*100))

The results we obtained are same as our expectations, that for people who did not survive there is more missing values (approximately 87.6%) compared to the survivals (60.2 %).

In [None]:
# TASK 2 >>>> Now it's your turn to explore the column 'Age' in the same way 
#             and think about whether the values are missing not at random

## 2.2 MCAR: Missing data Completely At Random 

When data are missing completely at random it means that the probability of being missing is the same for all observations in the dataset, i.e. the cause of the missing data is unrelated to the data.

Let's take as an example column 'Embarked' and its missing values.

In [None]:
# Get the rows where the values in 'Embarked' column are missing
data[data['Embarked'].isnull()]

Mrs. Stone was travelling in the first class with her maid Miss. Amelie Icard. They occupied the same Cabin B28, but the data about port of embarkation are missing. But we can not tell if the Embarked variable depends on any other variable. We can also see, that these women have survived, so we assume that they were asked for that information. It could happen that this information was simply lost when this dataset was created. The probability of losing these information is the same for every person on the Titanic, altough it would be probably impossible to prove. 

For curiosity: You can find out more information about Mrs. Stone and her maid [here](https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html), where the information about port is completed.

## 2.3 MAR: Missing At Random

We can say that the data are missing at random if the probability of being missing is the same only within groups defined by the observed data. Example of this case is when we take a sample from a population, where the probability to be included depends on some known property.
Unfortunately, I was not able to find the dataset for demonstration to this day.

# Appendix

$^{1}$ Inference and missing data, DONALD B. RUBIN, Biometrika, Volume 63, Issue 3, December 1976, Pages 581–592,