# Missing values

Missing data is the absence of values in certain observations of a variable. Missing data is an unavoidable problem in most data sources and may have a significant impact on the conclusions that we derived from the data. 


## Why is the data missing?

The source of missing data can vary. These are just some examples:

- The value was forgotten, lost, or not stored properly.

- The value does not exist.

- The value can't be known or identified.

To give real-life examples, a person may choose not to complete all fields in a form if they are not mandatory. That would introduce missing data. Sometimes people do not want to disclose some information, for example, income, or they do not know the answers to the questions being asked. 

Sometimes the value for a certain variable does not exist. For example, in the variable "total debt as percentage of total income" (very common in financial data), if the person has no income, then the total percentage of 0 does not exist, and therefore it will be a missing value.

It's important to understand why data is missing, in other words, the mechanism of missing data. We may process the missing information differently depending on this mechanism. Furthermore, identifying the source of missing data allows us to take steps to regulate that source and reduce the amount of missing data as data collection progresses.


## Missing data mechanisms

There are 3 mechanisms that lead to missing data. Two of them involve missing data randomly and the third one involves a systematic loss of data.


### Missing Completely at Random, MCAR:

If the likelihood of a value being missing is the same for all observations, the variable is missing completely at random (MCAR). The data points that are missing are a random subset of the observations.

If data is MCAR, then disregarding observations with missing data would not bias the inferences made.


### Missing at Random, MAR: 

If the probability of an observation being missing depends on available information (i.e., other variables), then the observation is missing at random (MAR). There is a relationship between the likelihood of a value being missing and the observed data.

For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for the men and women who did not disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In this example, missing data points are no longer a random subset of the total observations.

If we decide to proceed with the variable with missing values (in this case weight), we might want to include the variable gender to control the bias in weight for the missing observations.


### Missing Not at Random, MNAR: 

If there is a mechanism or a reason why data is missing, then that data is missing not at random (MNAR). For example, if people failed to fill out a depression survey because of their depression, those data points would be missing not at random. The missing data occurs due to the depression. 

Depending on the mechanism by which missing values occur, we may choose different missing data imputation methods.

## In this Demo:

In this notebook we will:

- Detect and quantify missing values.

- Identify the mechanisms of missing data.

We will use the financial dataset from a peer-to-peer finance company and the Titanic dataset.

To obtain the datasets, visit the lecture **Download datasets** in **Section 2**.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# To display all columns in the dataset.
pd.set_option('display.max_columns', None)

In [2]:
# Let's load the titanic dataset.
data = pd.read_csv('../titanic.csv')

# Let's inspect the first 5 rows.
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In Python, the missing values are stored as NaN. See, for example, the first row for the variable Cabin.

In [3]:
# We can quantify the missing values using
# the isnull() method plus the sum() method:

data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

There are 263 missing values for Age, 1014 for Cabin and 2 for Embarked.

In [4]:
# We can also use the mean() method after isnull()
# to obtain the fraction of missing values:

data.isnull().mean()

pclass       0.000000
survived     0.000000
name         0.000000
sex          0.000000
age          0.200917
sibsp        0.000000
parch        0.000000
ticket       0.000000
fare         0.000764
cabin        0.774637
embarked     0.001528
boat         0.628724
body         0.907563
home.dest    0.430863
dtype: float64

In the variables Age there is 20% of data missing. 

There is 77 percent of data missing in the variable Cabin, in which the passenger was traveling.

There is 0.2 percent of data missing in the field Embarked (the port from which the passenger boarded the Titanic). 

## Mechanisms of Missing Data

### Missing data Not At Random (MNAR)

The missing values of the variables **age** and **cabin**, were introduced systematically. For many of those who did not survive, their **age** or their **cabin** remains unknown. The people who survived could have been otherwise asked for that information.

Can we infer this by looking at the data?

If data is MNAR, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

In [5]:
# Let's create a binary variable that indicates 
# if the value of cabin is missing.

data['cabin_null'] = np.where(data['cabin'].isnull(), 1, 0)

In [6]:
# Let's evaluate the percentage of missing values in
# cabin for the people who survived vs the non-survivors.

# The variable Survived takes the value 1 if the passenger
# survived, or 0 otherwise.

# Group data by Survived vs Non-Survived
# and find the percentage of NaN for Cabin.

data.groupby(['survived'])['cabin_null'].mean()

survived
0    0.873918
1    0.614000
Name: cabin_null, dtype: float64

In [7]:
# Another way of doing the above, with less lines
# of code:

data['cabin'].isnull().groupby(data['survived']).mean()

survived
0    0.873918
1    0.614000
Name: cabin, dtype: float64

The percentage of missing values is higher for those who did not survive (87% vs 60% for survivors). This finding could support our hypothesis that the data is missing because after people died, the information could not be retrieved.

**Note**: to truly understand whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets, can only point us in the right direction or help us make assumptions.

In [8]:
# Let's do the same for the variable age:

# First, we create a binary variable to indicate
# if a value is missing.

data['age_null'] = np.where(data['age'].isnull(), 1, 0)

# Then we look at the mean in survivors and non-survivors:
data.groupby(['survived'])['age_null'].mean()

survived
0    0.234858
1    0.146000
Name: age_null, dtype: float64

In [9]:
# The same with simpler code :)

data['age'].isnull().groupby(data['survived']).mean()

survived
0    0.234858
1    0.146000
Name: age, dtype: float64

We observe more missing data points for the people who did not survive. The analysis therefore suggests that there was a systematic loss of data: people who did not survive had more missing information. Presumably, the method chosen to gather the information contributes to the generation of this missing data.

### Missing data Completely At Random (MCAR)

In [10]:
# In the titanic dataset, there are also missing values
# for the variable Embarked.

# Let's have a look.

# Let's slice the dataframe to show only the observations
# with missing values for Embarked.

data[data['embarked'].isnull()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,cabin_null,age_null
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,6,,,0,0
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,6,,"Cincinatti, OH",0,0


These 2 women were traveling together. Miss Icard was the maid of Mrs. Stone.

A priori, there does not seem to be an indication that the missing information in the variable "embarked" is dependent on any other variable, and the fact that these women survived means that they could have been asked for this information.

It is very likely the values were lost at the time of building the dataset.

If these values are MCAR, the likelihood of data missing for these two women is the same as the likelihood of data missing for any other person on the Titanic. Of course, this will be hard, if possible at all, to prove.

### Missing data at Random (MAR)

We will use the financial dataset from the peer-to-peer lending company.

We will look at the variables "employment" and "years in employment", both declared by the borrowers at the time of applying for a loan. 

In this example, data missing in employment are associated with data missing in time in employment.

In [11]:
# Let's load the dataset with just the 2
# variables.

data = pd.read_csv('../loan.csv', usecols=['employment', 'time_employed'])

data.head()

Unnamed: 0,employment,time_employed
0,Teacher,<=5 years
1,Accountant,<=5 years
2,Statistician,<=5 years
3,Other,<=5 years
4,Bus driver,>5 years


In [12]:
# Let's check the percentage of missing data.

data.isnull().mean()

employment       0.0611
time_employed    0.0529
dtype: float64

Both variables have roughly the same percentage of missing observations.

In [13]:
# lLt's insptect the different employment types.

# Number of different employments.
print('Number of employments: {}'.format(
    len(data['employment'].unique())))

# Examples of employments.
data['employment'].unique()

Number of employments: 12


array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

Note the missing data along with the different employment values.

In [14]:
# Let's inspect the variable time employed.

data['time_employed'].unique()

array(['<=5 years', '>5 years', nan], dtype=object)

The customer can't enter a value for employment time if they are not employed. They could be students, retired, self-employed, or something else. Note how these 2 variables are related to each other.

In [15]:
# Let's calculate the proportion of missing data 
# in time_employed variable when
# customers declared employment.

# Customers who declared employment
t = data[~data['employment'].isnull()]

# Percentage of missing data in time employed
t['time_employed'].isnull().mean()

0.0005325380764724678

In [16]:
# Let's do the same for those borrowers who did not 
# report employment.

# Customers who did not declare employment.
t = data[data['employment'].isnull()]

# Percentage of missing data in time employed.
t['time_employed'].isnull().mean()

0.8576104746317512

The number of borrowers who have reported occupation and have missing values in time_employed is minimal. Customers who did not report an occupation, on the other hand, mostly show missing values in the time_employed variable.

This further supports the hypothesis that the missing values in employment are related to the missing values in time_employed.

This is an example of MAR.

**That is all for this demonstration. I hope you enjoyed the notebook, and I'll see you in the next one.**