## Understanding Missing Value Mechanics

In [1]:
#imports
import pandas as pd
import numpy as np

### A. Funtions and methods in pandas for missing values

The dataset we are going to use is about UFO sightings in different locations in USA.

In [2]:
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


<h4>Evaluating for Missing Data</h4>

The missing values are converted to NaN. We use Pandas functions to identify these missing values. There are two methods to detect missing data:
<ol>
    <li><b>.isnull() alias .isna()</b></li>
    <li><b>.notnull() alias .notna()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [3]:
ufo.isna().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False
18240,False,True,False,False,False


In [4]:
ufo.notna().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,True,False,True,True,True
18237,True,False,True,True,True
18238,True,False,False,True,True
18239,True,True,True,True,True
18240,True,False,True,True,True


<h4>Count missing values in each column</h4>
<p>
Using dataframe method isna() followed by sum(), we can quickly figure out the number of missing values in each column. "True" represents a missing value, "False"  means the value is present in the dataset. True is equivalent to 1 and False to 0 for sum() method. It simply returns sum for each column.
</p>

In [5]:
pd.Series([True, False, True]).sum()

2

In [6]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

We can also look at the rows with missing value for a particular column as below.

In [7]:
ufo.loc[ufo.City.isnull(), :]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


We may need to drop missing values. "dropna()" is both a DataFrame and Series method which helps us to drop rows and columns with missing values.

In [8]:
ufo.shape

(18241, 5)

While using "dropna()" with a DataFrame, we need to pay attention to "how" parameter. "how=any" will delete a row or column if any value in that row or column is NaN. "how=all" will delete a row or column if all the values in that row or column is NaN. By default "axis=0" so the method will drop rows and "inplace=False" so changes will not be made to the DataFrame.

In [9]:
ufo.dropna(how="any").shape

(2486, 5)

Since we are dropping any row with a missing value. We are dropping a huge part of our DatFrame.

In [10]:
ufo.dropna(how="all").shape

(18241, 5)

Since we are dropping a row if all values are missing. We are not dropping anything.

Another important parameter used with "dropna()" is "subset". We can specify a subset of columns labels or row labels of our DataFrame using the parameter. Now "how" will be applied to only the subset. "how='all'" will drop a row or column if all the values in a row or column of the subset is NaN. 

In [11]:
ufo.dropna(subset=["City", "Shape Reported"], how="all").shape

(18237, 5)

"dropna" is also a parameter for "value_counts()" method. By default, its value is True. We can change the value to False, to show the number of missing values in a particular column. Check for the count of NaN in the second line of code below.

In [12]:
ufo["Shape Reported"].value_counts()

LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
PYRAMID         1
DOME            1
FLARE           1
HEXAGON         1
Name: Shape Reported, dtype: int64

In [13]:
ufo["Shape Reported"].value_counts(dropna=False)

LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
FLARE           1
PYRAMID         1
DOME            1
HEXAGON         1
Name: Shape Reported, dtype: int64

We can fill the missing values using "fillna()" as a series or DataFrame method.

In [14]:
ufo["Shape Reported"].fillna("VARIOUS", inplace=True)

In [15]:
ufo["Shape Reported"].value_counts(dropna=False)

VARIOUS      2977
LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
PYRAMID         1
DOME            1
FLARE           1
HEXAGON         1
Name: Shape Reported, dtype: int64

<hr>

### B. Introduction to Titanic Dataset

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

The dataset about the passengers is now being used to work with Machine Learning algorithms to predict whether someone survived or not. You can learn more about the dataset <a href="https://www.kaggle.com/c/titanic">here</a>.

In [16]:
#each row represents one passenger
titanic = pd.read_csv("http://bit.ly/kaggletrain")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
#checking shape
titanic.shape

(891, 12)

In [18]:
#checking data types
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

<hr>

### C. Different nature of missing values

#### 1. Missing Completely at Random, MCAR:
When data are MCAR, the missing data is independent of other observed and unobserved features. In most cases its just human error and has a consistent rate. For example, say you are trying to find out on average how many overdue books students in a campus have. You collect the data from library incharge's computer. You notice that for some students, number of overdue books data is missing. You consult the library incharge about it. He informs you that about 1% of time, a librarian forgets to enter the value. You try to confirm if the missing value can be related to a particular field area or anything else. The incharge informs you that its completly random. 

In these instances, the missing data do not introduce bias. MCAR is generally regarded as a strong and often unrealistic assumption. To know if data is MCAR, we can group it on various axes, for example the sex of the responder. If the missing value rate is about the same for these groups, then the data is likely MCAR. On the other hand, if missing value rate is different for different groups then the data is likely **Missing at Random (MAR)**.

In [19]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [20]:
titanic.Embarked.value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

There are three columns with missing values. Notice 'Embarked' column has only two missing value. Embarked, here, means port of Embarkation and has three values C:Cherbourg, Q:Queenstown, and S:Southampton. We will begin by creating a column representing if Embarked is missing or not. Then, we will group the data based on columns with categorical values. If the missing rate is same for different for different categories, we will have numeric values to present to a person with domain knowledge. Understand that a procedure is never enough to determine type of missing data, and domain knowledge plays an irreplaceble role.

In [21]:
titanic['Emb_null'] = titanic.Embarked.isnull()

In [22]:
titanic.groupby("Survived").Emb_null.mean()

Survived
0    0.000000
1    0.005848
Name: Emb_null, dtype: float64

Notice that rate of missing data is 0 for people who didn't survive and almost 0 for people who survived. Needs to be verified by a domain expert.

In [23]:
titanic.groupby("Pclass").Emb_null.mean()

Pclass
1    0.009259
2    0.000000
3    0.000000
Name: Emb_null, dtype: float64

Notice that rate of missing data is almost 0 for all passenger class. Needs to be verified by a domain expert.

In [24]:
titanic.groupby("Sex").Emb_null.mean()

Sex
female    0.006369
male      0.000000
Name: Emb_null, dtype: float64

Notice that rate of missing data is almost 0 for both sex. Needs to be verified by a domain expert.

In [25]:
titanic.drop("Emb_null", axis="columns", inplace=True)

#### 2. Missing at Random, MAR: 

When data are MAR, the missing data is systematically related to other observed feature or combination of features. For example, instead of asking getting the data from library incharge, you decided to ask all students to fill a form asking about the number of overdue books they have. Say, 90% females filled the form while only 70% of male filled the form. Now, you have a list of al students along with their sex and to that list you added the poll results. You will have missing data for some of the students overdue books number. Now the data is not missing completly at random. You data is much more representative of female students than male students. That is, if probability of completion of the survey (or any method the data was collected) is related to their sex (or any other feature) (which is fully observed) but not the number of overdue books a student has (the variable under consideration, which has missing values), then the data may be regarded as MAR.

Proper accounting for the known factors (in the above example, sex) can produce unbiased results in analysis i.e. if we know that the data is representative of 90% of female students but only 70% of male students, we can produce unbiased analysis results.

In [26]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Let's check if Age is MAR

In [27]:
titanic["Age_null"] = titanic.Age.isnull()

In [28]:
titanic.groupby("Survived").Age_null.mean()

Survived
0    0.227687
1    0.152047
Name: Age_null, dtype: float64

Here, differnce between missing rate is about 7%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [29]:
titanic.groupby("Pclass").Age_null.mean()

Pclass
1    0.138889
2    0.059783
3    0.276986
Name: Age_null, dtype: float64

Here, differnce between missing rate is about 22%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [30]:
titanic.groupby("Sex").Age_null.mean()

Sex
female    0.168790
male      0.214905
Name: Age_null, dtype: float64

Here, differnce between missing rate is about 5%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [31]:
titanic.drop("Age_null", axis="columns", inplace=True)

Let's check if Cabin is MAR

In [32]:
titanic["Cabin_null"] = titanic.Cabin.isnull()

In [33]:
titanic.groupby("Survived").Cabin_null.mean()

Survived
0    0.876138
1    0.602339
Name: Cabin_null, dtype: float64

Here, differnce between missing rate is about 27%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [34]:
titanic.groupby("Pclass").Cabin_null.mean()

Pclass
1    0.185185
2    0.913043
3    0.975560
Name: Cabin_null, dtype: float64

Here, differnce between missing rate is about 79%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [35]:
titanic.groupby("Sex").Cabin_null.mean()

Sex
female    0.691083
male      0.814558
Name: Cabin_null, dtype: float64

Here, differnce between missing rate is about 12%. Is it significant? What may be the cause? Needs to be verified by a domain expert.

In [36]:
titanic.drop("Cabin_null", axis="columns", inplace=True)

#### 3. Missing not at random, MNAR:

When data are MNAR, the fact that the data are missing is systematically related to the missing data itself, that is, the missingness is related to events or factors which were not measured by the researcher. To extend the previous example, say same proportion of male and female students responded to the survey, but students with greator number of overdue books didn't take part i.e. students with a few overdue books took the survey, but students with greator number of overdue books, due to guilt or some other reason, didn't take the survey. Now the missing data is result of the true value of the data itself.

The fact that the source of missing data is the data itself (circular reasoning), this issue difficult to address in analysis and the estimate will likely be biased.

<hr>

### D. All the techniques of handling missing values

1. Row Deletion
2. Mean/Median/Mode Imputation
3. Random Sample Imputation
4. Capturing NaN values with a new feature
5. Arbitrary value imputation
6. Replacing NaN with a new category