# Handling Missing Data

In this section, we will study ways to identify and treat missing data. We will:
- Identify missing data in dataframes
- Treat (ignore, remove or impute) missing values

There are various reasons for missing data, such as, it was not entered during manual data entry (human-error), it wasn't available (e.g. DOB of certain people), etc.

In python, missing data is represented using either of the two objects ```NaN``` (Not a Number) or ```NULL```. We'll not get into the differences between them and how Python stores them internally etc. We'll focus on studying ways to identify and treat missing values in Pandas dataframes.

There are four main methods to identify and treat missing data:
- ```isnull()```: Indicates presence of missing values
- ```notnull()```: Opposite of ```isnull()```
- ```dropna()```: Drops the missing values and returns the rest
- ```fillna()```: Fills (or imputes) the missing values by a specified value


For this exercise, we will use the **credit approval dataset**. Both the dataset and the data description are <a href="https://archive.ics.uci.edu/ml/datasets/Credit+Approval">available here</a>. 

The column names etc. have been arbitrarily named to protect data confidentiality. 




In [32]:
import numpy as np
import pandas as pd

bank = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header=None)
bank.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [46]:
# renaming the columns
bank.columns = ['a', 'b', 'c', 'd', 'e', 'f', 
                'g', 'h', 'i', 'j', 'k', 'l', 
               'm', 'n', 'o', 'p']
bank.head()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [47]:
# shape
print(bank.shape)

# variable types etc.
print(bank.info())

(690, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
a    690 non-null object
b    690 non-null object
c    690 non-null float64
d    690 non-null object
e    690 non-null object
f    690 non-null object
g    690 non-null object
h    690 non-null float64
i    690 non-null object
j    690 non-null object
k    690 non-null int64
l    690 non-null object
m    690 non-null object
n    690 non-null object
o    690 non-null int64
p    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 53.9+ KB
None


### Identifying Missing Values

The methods ```isnull()``` and ```notnull()``` are the most common ways of identifying missing values. 

While handling missing data, you first need to identify the rows and columns containing missing values, count the number of missing values, and then decide how you want to treat them.

```isnull()``` returns a boolean (True/False) which can then be used to find the rows or columns containing missing values.

In [48]:
# isnull()
bank.isnull()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


It is hard to understand how missing values are distributed by looking at the boolean dataframe above. Thus, we need to do some counting.

####  Counting Missing Values in Columns

You can calculate the number of missing values in each column by ```df.isnull().sum()``` 

In [49]:
# summing up the missing values (column-wise)
bank.isnull().sum()

a    0
b    0
c    0
d    0
e    0
f    0
g    0
h    0
i    0
j    0
k    0
l    0
m    0
n    0
o    0
p    0
dtype: int64

Looks like the dataset has no missing values at all, but that's not true. Though python represents missing values as ```NaN``` or ```null```, there are various other conventions.

For e.g. in this dataset, the missing values are represented by question marks ```?```. 

In [51]:
# find question marks in column a
bank.loc[(bank.a == "?"), :]


Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
248,?,24.5,12.75,u,g,c,bb,4.75,t,t,2,f,g,73,444,+
327,?,40.83,3.5,u,g,i,bb,0.5,f,f,0,f,s,1160,0,-
346,?,32.25,1.5,u,g,c,v,0.25,f,f,0,t,g,372,122,-
374,?,28.17,0.585,u,g,aa,v,0.04,f,f,0,f,g,260,1004,-
453,?,29.75,0.665,u,g,w,v,0.25,f,f,0,t,g,300,0,-
479,?,26.5,2.71,y,p,?,?,0.085,f,f,0,f,s,80,0,-
489,?,45.33,1.0,u,g,q,v,0.125,f,f,0,t,g,263,0,-
520,?,20.42,7.5,u,g,k,v,1.5,t,t,1,f,g,160,234,+
598,?,20.08,0.125,u,g,q,v,1.0,f,t,1,f,g,240,768,+
601,?,42.25,1.75,y,p,?,?,0.0,f,f,0,t,g,150,1,-


You can note that there are ```?```s in other columns as well. 

The ideal way to handle this data is to:
- First identify all the ```?```s
- Convert them into standard missing value representations such as ```null```, and 
- Then use the inbuilt functions such as ```df.isnull()``` etc. to work with them.



In [None]:
# Find all question marks in the dataset