# Methods for Dealing with Missing Values
More often than not, the datasets you will be dealing with will contain misisng values. This can happen when there wasn't a measurement available for a data point, when there was some kind of error (user, machine, etc.), sometimes it's a randome value here or there, sometimes it's an entire column of data. 

### Imputation
Imputation means "filling in information based upon other values in your dataset". 
- You must do this properly or it will change the distribution of your dataset
- For example, if you have male and female subjects in a pregnancy study, you must impute data separately for each group or you could end up with very wonky results
- Most common methods are to fill in the missing values with the mean or median of that category

### Dropping Missing Values
If imputation doesn't seem valid or meaningful for your dataset, another alternative is to drop the entire row containing the missing value: 
- You can lose a lot of data this way

### Setting the Missing Value to 0
Be careful with this one, you must think carefully about the nature of the data that you are setting to a 0 value. For example, if you are measuring temperature, a 0 value actually contains meaning and could skew your dataset.

## Finding Missing Data
Pandas has two wasy to mark missing values: 
- NaN = Not a Number. This originated from the numpy package (Numbers in Python), and stores nulls as a float rather than a heftier object data type
- None = an empty object


Let's look at some different ways to find null values in a dataset. 

In [15]:
import pandas as pd
import numpy as np

# create test dataset
data = pd.DataFrame({"col1": [1, np.nan, 'hello', None],
                    "col2:" : [2, 'world', None, np.nan]})
data

Unnamed: 0,col1,col2:
0,1,2
1,,world
2,hello,
3,,


In the above dataframe we can cleary see the two different types of missing information. But how do wee find them in a larger dataset? 

In [16]:
data.isnull()

Unnamed: 0,col1,col2:
0,False,False
1,True,False
2,False,True
3,True,True


This gives us booleans letting us know where in the dataset we have null values.

## Dropping Null Values
If we have a lot of data and simply want to drop null values from the dataset, we can use the pandas funcition `dropna()`:  

In [24]:
data.dropna(axis=0)

Unnamed: 0,col1,col2:
0,1,2


We can see that it drops all rows that contain any null or NaN values. 

In [25]:
data.dropna(axis=1)

0
1
2
3


Adding an axis of 1 drops columns instead of rows. 

This tosses out the entire row or column, but what if we only want to discard rows or columns that are entirely null? 

In [26]:
data.dropna(how='all')

Unnamed: 0,col1,col2:
0,1,2
1,,world
2,hello,


## Filling Null Values
Since our table doesn't have a lot of data, it might be better for us to impute those missing values or pad them with 0 so the rows don't get dropped. 

We can easily fill it with a value: 

In [27]:
data.fillna(0)

Unnamed: 0,col1,col2:
0,1,2
1,0,world
2,hello,0
3,0,0


We can forward fill it with the next value: 

In [28]:
data.fillna(method="ffill")

Unnamed: 0,col1,col2:
0,1,2
1,1,world
2,hello,world
3,hello,world


Same with the reverse: 

In [30]:
data.fillna(method="bfill")

Unnamed: 0,col1,col2:
0,1,2
1,hello,world
2,hello,
3,,


## Imputing Null Values
And finally, for numeric columns, we can impute missing values with the mean or mode: 

In [33]:
data.fillna(data.mean())

Unnamed: 0,col1,col2:
0,1,2
1,,world
2,hello,
3,,
