## Handle Missing Values

### Introduce Missing Values
We're going to download the Iris dataset from sklearn and introduce some of our own NaN values. Run the code below and don't worry too much about the details. We're reading in a dataset and adding some missing rows of data: 

In [7]:
from sklearn import datasets
import pandas as pd
import numpy as np
%matplotlib inline

# import iris dataset
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df.columns = ["SEPAL_LENGTH", "SEPAL_WIDTH", "PETAL_LENGTH", "PETAL_WIDTH"]
iris_df.loc[150] = [np.NaN, np.NaN, np.NaN, np.NaN] # adding a row of null values
iris_df.loc[151]= [np.NaN, np.NaN, 0, 5.5] # adding a row with some null values
iris_df.loc[152]= [np.NaN, np.NaN, 0, 5.5] # adding a row with some null values
iris_df.loc[153] = [np.NaN, np.NaN, np.NaN, 5.6] # adding a row of null values
iris_df["TEST"] = np.nan # adding column of nan's to remove

iris_df.tail()

Unnamed: 0,SEPAL_LENGTH,SEPAL_WIDTH,PETAL_LENGTH,PETAL_WIDTH,TEST
149,5.9,3.0,5.1,1.8,
150,,,,,
151,,,0.0,5.5,
152,,,0.0,5.5,
153,,,,5.6,


### Find Missing Values
If we have a large enough dataset and only a few rows contain missing data, we can choose to simply remove the missing data points.

Check iris_df to see how many rows contain null values. 

In [None]:
# find rows with missing values

### Remove all columns with missing values
As we can see above, if we remove all rows with missing values, we would lose our entire dataset. What we can do for starters is get rid of the "TEST" column, as that only contains null values and won't add any information to our analysis. 

In [None]:
iris_df2 = # remove the TEST column

### Remove all rows that contain at least three null values
We did see that there are just a few rows that contain more than three null values, so we can safely remove these becuase they contain too many missing values to give us menaingful information. 

Wait! You didn't teach us that! That's right, I also want you to practice your googling skills. Try doing a google search for "pandas remove rows with more than two missing values"

In [None]:
iris_df3 = # remove rows with at least three missing values

### Imputation: Replace Missing Values with Mean
Now that we've gotten rid of rows and columns that won't provide us with meaningful information, we can retain the rows with less than three missing values by replacing the missing values with the column mean. 

#### First calculate the mean for each column: 

In [8]:
iris_df["SEPAL_LENGTH"].mean()
# you do the rest

5.843333333333335

We can use the formula for each columns mean along with the `fillna` function to replace each column's `NaN` values with the column mean. 

In [10]:
iris_df["SEPAL_LENGTH"].fillna(iris_df["SEPAL_LENGTH"].mean()).head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: SEPAL_LENGTH, dtype: float64

In [None]:
# replace the remaining columns' null values with their column means, as above