# Feature Engineering - Analysis of Missing Values and its Mechanisms

Missing data, or missing values, occur when no data / no value is stored for certain observations within a variable.

Incomplete data is an unavoidable problem in most data sources, and may have a significant impact on the conclusions that can be derived from the data.

#### Why is data missing?
The source of missing data can be very different. These are just a few examples:

A value is missing because it was forgotten, lost or not stored properly
For a certain observation, the value does not exist
The value can’t be known or identified
In many organisations, information is collected into a form by a person talking with a client on the phone, or alternatively, by customers filling forms online. Often, the person entering the data does not complete all the fields in the form. Many of the fields are not compulsory, which may lead to missing values.

The reasons for omitting the information can vary: perhaps the person does not want to disclose some information, for example income, or they do not know the answer, or the answer is not applicable for a certain circumstance, or on the contrary, the person in the organisation wants to spare the customer some time, and therefore omits asking questions they think are not so relevant.

There are other cases where the value for a certain variable does not exist. For example, in the variable ‘total debt as percentage of total income’ (very common in financial data), if the person has no income, then the total percentage of 0 does not exist, and therefore it will be a missing value.

It is important to understand how the missing data are introduced in the dataset, that is, the mechanisms by which missing information is introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing data, we may choose to take action to control that source and decrease the amount of missing information looking forward during data collection.

#### Missing Data Mechanisms
There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.

#### Missing Completely at Random (MCAR):
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other. If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

#### Missing at Random (MAR):
MAR occurs when there is a relationship between the propensity of missing values and the observed data. In other words, the probability of an observation being missing depends on available information (i.e., other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women who do not disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

#### Missing Not at Random (MNAR):
Missing data is not at random (MNAR) when there is a mechanism or a reason why missing values are introduced in the dataset. For example, MNAR would occur if people failed to fill in a depression survey because of their level of depression. Here, the missing of data is related to the outcome, depression. Similarly, when a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don’t have them, because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.

Understanding the mechanism by which data is missing is important to decide which methods to use to impute the missing values.

In this Blog:
In the following cells we will:

Learn how to detect and quantify missing values
Try to identify the 3 different mechanisms of missing data introduction
We will use the toy Loan dataset and the Titanic dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

  return f(*args, **kwds)


In [2]:
data = pd.read_csv("data/titanic.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Find any null values and its count

In [3]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
data.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

## Missing Not at Random (MNAR)

In [6]:
# Put 1 if its null else 0
data["Cabin_Null"] = np.where(data["Cabin"].isnull(), 1, 0)

In [7]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_Null
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


In [8]:
# Finding the % of missing Cabin data for the peopele who survived and not survived
data.groupby(["Survived"])["Cabin_Null"].mean()

Survived
0    0.876138
1    0.602339
Name: Cabin_Null, dtype: float64

In [11]:
# Finding the % of missing Age data for the peopele who survived and not survived
data["Age"].isnull().groupby(data["Survived"]).mean()

Survived
0    0.227687
1    0.152047
Name: Age, dtype: float64

As we can see there is a pattern that data are missing more for people who didn't survived than the people who survived. So this doesn't seems to be random.

## Missing Completely at Random (MCAR)

In [13]:
data["Embarked"].isnull().mean()

0.002244668911335578

In [14]:
data[data["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_Null
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,0
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,0


As we can see here there is NO pattern that data are missing more for people who didn't survived than the people who survived. So this could be just random.