**This notebook is mostly derived from the tutorial on [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) by [Rachael Tatman](https://www.kaggle.com/rtatman) at [kaggle](https://www.kaggle.com/)**

# Handling Missing Values

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Loading Dataset

In [2]:
data = pd.read_csv("NFL Play by Play 2009-2016 (v3) (copy).csv")

## Let's Take A Look At Data 
First thing to do is to check if the Data have any missing values

In [3]:
data.head()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


As we can see there are some missing values in our data.

## How many missing data values we have?
Now that we know our data have missing values let's find out how many of them are missing.

In [4]:
missing_values_count = data.isnull().sum()
missing_values_count[:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            54218
time              188
TimeUnder           0
TimeSecs          188
PlayTimeDiff      374
SideofField       450
dtype: int64

As we can see a lot of data points are missing, let's calculate the percentage of missing values to get a better idea.

In [8]:
total_cells = np.product(data.shape)
total_missing = missing_values_count.sum()

missing_percentage = (total_missing/total_cells)*100
print(missing_percentage)

24.85847694188906


Almost a quarter of the cells in the data are missing. 

## Why data is missing?
Using our intuition we can ask a question :<br>
    **Is the value missing because it wasn't recorded or because it doesn't exist?** <br>
If the value is missing because it doesn't exist then there is no sense in guessing what it might be. We should leave these values as NaN.
And if the values are missing because it wasn't recorded, then we can try to guess what it might have been respective to that column & row. This is called **imputation**

In [10]:
missing_values_count[:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            54218
time              188
TimeUnder           0
TimeSecs          188
PlayTimeDiff      374
SideofField       450
dtype: int64

Let's consider the "TimeSecs" column. After looking at the documentation of the data I can see that this column contain information on the number of seconds left in the game. So the data in this column is missing because it was not recorded rather than that it doesn't exist. Therefore we can try to guess the values of missing data points. <br> Now consider the "PenalizedTeam" column, we can say that the missing data in this column is because it simply doesn't exist without even looking at documentation. This is because if there was no penalty then it doesn't make any sense to say which team was penalized. Therefore it would be better if we leave this empty or add a third value like "neither".

## Drop missing values.
If you don't want to go into the details of **why data is missing?** then you can simply use dropna() method of pandas to drop the rows with missing values (NOTE: This is not recommended as you may lose important information)

In [12]:
#data.dropna()

Make sure in above code add the "axis=1" parameter otherwise we may loose all the rows if every row have at least one missing value. axis=1 specifies that we want to drop the columns which have at least one missing value.

In [15]:
data_with_dropped_columns = data.dropna(axis=1)
data_with_dropped_columns.head(10)

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,0,,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,0,,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,0,,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009
5,2009-09-10,2009091000,2,1,14,10,0,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009
6,2009-09-10,2009091000,2,1,13,10,4,1,4,0,...,0,,3,3,3,3,3,0.0,0.0,2009
7,2009-09-10,2009091000,2,1,13,6,2,1,-2,0,...,0,,3,3,3,3,3,0.0,0.0,2009
8,2009-09-10,2009091000,2,1,12,8,2,1,11,0,...,0,,3,3,3,3,3,0.0,0.0,2009
9,2009-09-10,2009091000,3,1,12,10,3,1,3,0,...,0,,3,3,3,3,3,0.0,0.0,2009


**How much data did we lose?**

In [21]:
print("Columns in original dataset: %d \n" %data.shape[1])
print("Columns after dropping na's: %d" %data_with_dropped_columns.shape[1])

Columns in original dataset: 102 

Columns after dropping na's: 41


Originally our data had 102 columns and after dropping columns with at least one na we are left with 41 columns only. At this point we have dropped all NaN's.

## Filling Missing Values Automatically
Instead of dropping the columns we can try to fill the missing values. Let's see how:

In [24]:
# Make a small subset of data for study purpose
subset_data = data.loc[:, 'EPA':'Season'].head(10)
subset_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009
5,-0.696302,,,0.558929,0.441071,0.578453,0.421547,0.441071,-0.019524,,,2009
6,-0.179149,-0.343085,0.163935,0.578453,0.421547,0.582881,0.417119,0.421547,-0.004427,-0.010456,0.006029,2009
7,-1.119477,,,0.582881,0.417119,0.617544,0.382456,0.417119,-0.034663,,,2009
8,-0.021313,,,0.617544,0.382456,0.591489,0.408511,0.382456,0.026054,,,2009
9,-0.215293,-0.756894,0.541602,0.591489,0.408511,0.585405,0.414595,0.591489,-0.006084,-0.024526,0.018442,2009


We can use Panda's fillna() function to fill missing values in our dataframe. Using fillna() we can specify what we want NaN to be replaced with, for instance we want to replace all NaN's with 0 (zero):

In [25]:
subset_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009
5,-0.696302,0.0,0.0,0.558929,0.441071,0.578453,0.421547,0.441071,-0.019524,0.0,0.0,2009
6,-0.179149,-0.343085,0.163935,0.578453,0.421547,0.582881,0.417119,0.421547,-0.004427,-0.010456,0.006029,2009
7,-1.119477,0.0,0.0,0.582881,0.417119,0.617544,0.382456,0.417119,-0.034663,0.0,0.0,2009
8,-0.021313,0.0,0.0,0.617544,0.382456,0.591489,0.408511,0.382456,0.026054,0.0,0.0,2009
9,-0.215293,-0.756894,0.541602,0.591489,0.408511,0.585405,0.414595,0.591489,-0.006084,-0.024526,0.018442,2009


We can also replace missing values with whatever value comes directly after it in the same column. This makes a lot of sense for the dataset where the observations have some sort of logical order.

In [28]:
# Replace all Na's with the value that comes after it or if there is no value after it then fill with 0
subset_data.fillna(method='bfill', axis=0).fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,-0.343085,0.163935,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,-0.010456,0.006029,2009
5,-0.696302,-0.343085,0.163935,0.558929,0.441071,0.578453,0.421547,0.441071,-0.019524,-0.010456,0.006029,2009
6,-0.179149,-0.343085,0.163935,0.578453,0.421547,0.582881,0.417119,0.421547,-0.004427,-0.010456,0.006029,2009
7,-1.119477,-0.756894,0.541602,0.582881,0.417119,0.617544,0.382456,0.417119,-0.034663,-0.024526,0.018442,2009
8,-0.021313,-0.756894,0.541602,0.617544,0.382456,0.591489,0.408511,0.382456,0.026054,-0.024526,0.018442,2009
9,-0.215293,-0.756894,0.541602,0.591489,0.408511,0.585405,0.414595,0.591489,-0.006084,-0.024526,0.018442,2009
