# Data cleaning: handling missing data

This notebook is an abstraction of the Kaggle's 5-Day Challenge.

The **goal** of this exercise is to clean missing entries. 

The **evaluation** of the assignment will follow:

* Design process and thinking as a data engineer.
* Validation of knowledge on the different tools and steps throughout the process.
* Storytelling and visualisation of the insights.

Exercise **workflow**:

* Import dependencies & download dataset from [here](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016/download).
* Explore missing data points and values.
* Assess the reason for the missing data points and values.
* Evaluate a method to drop the missing values.
* Evaluate a method to fill the missing values.
    
Notes:

* Write your code into the `TODO` cells.
* Feel free to choose how to present the results throughout the exercise, what libraries (e.g., seaborn, bokeh, etc.) and/or tools (e.g., PowerBI or Tableau).

## Preamble
________

In [31]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

np.random.seed(0) 

## Read data

In [32]:
org_dt = pd.read_csv("NFL_Play_ by_Play_ 2009-2018_(v5).csv")


## create a copy of original df
org_cpy = org_dt.copy()

# print sample of df
org_dt.head(5)

Unnamed: 0,play_id,game_id,home_team,away_team,posteam,posteam_type,defteam,side_of_field,yardline_100,game_date,...,penalty_player_id,penalty_player_name,penalty_yards,replay_or_challenge,replay_or_challenge_result,penalty_type,defensive_two_point_attempt,defensive_two_point_conv,defensive_extra_point_attempt,defensive_extra_point_conv
0,46,2009091000,PIT,TEN,PIT,home,TEN,TEN,30.0,2009-09-10,...,,,,0.0,,,0.0,0.0,0.0,0.0
1,68,2009091000,PIT,TEN,PIT,home,TEN,PIT,58.0,2009-09-10,...,,,,0.0,,,0.0,0.0,0.0,0.0
2,92,2009091000,PIT,TEN,PIT,home,TEN,PIT,53.0,2009-09-10,...,,,,0.0,,,0.0,0.0,0.0,0.0
3,113,2009091000,PIT,TEN,PIT,home,TEN,PIT,56.0,2009-09-10,...,,,,0.0,,,0.0,0.0,0.0,0.0
4,139,2009091000,PIT,TEN,PIT,home,TEN,PIT,56.0,2009-09-10,...,,,,0.0,,,0.0,0.0,0.0,0.0


## Data
________

**TODO**

* Download the data from [here](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016/download)
* Get some info regarding the dataframe (e.g., shape, dimensions, column names, etc.)

In [33]:
## Size, shape, dimensions of dataframe

print(" Size of data : ", org_dt.size)
print(" Shape (Row, columns) of data : ", org_dt.shape)
print(" Dimensions of data : ", org_dt.ndim)


 Size of data :  6531060
 Shape (Row, columns) of data :  (25612, 255)
 Dimensions of data :  2


In [34]:
## Column names
org_dt.columns.values

array(['play_id', 'game_id', 'home_team', 'away_team', 'posteam',
       'posteam_type', 'defteam', 'side_of_field', 'yardline_100',
       'game_date', 'quarter_seconds_remaining', 'half_seconds_remaining',
       'game_seconds_remaining', 'game_half', 'quarter_end', 'drive',
       'sp', 'qtr', 'down', 'goal_to_go', 'time', 'yrdln', 'ydstogo',
       'ydsnet', 'desc', 'play_type', 'yards_gained', 'shotgun',
       'no_huddle', 'qb_dropback', 'qb_kneel', 'qb_spike', 'qb_scramble',
       'pass_length', 'pass_location', 'air_yards', 'yards_after_catch',
       'run_location', 'run_gap', 'field_goal_result', 'kick_distance',
       'extra_point_result', 'two_point_conv_result',
       'home_timeouts_remaining', 'away_timeouts_remaining', 'timeout',
       'timeout_team', 'td_team', 'posteam_timeouts_remaining',
       'defteam_timeouts_remaining', 'total_home_score',
       'total_away_score', 'posteam_score', 'defteam_score',
       'score_differential', 'posteam_score_post', 'defteam_

## Exploration of missing data points and values
___

**TODO**

* How many missing values are there?
* What's the percentage of missing values?
* How many missing data points per column are there?

In [42]:
# Number of missing values for each column
missing_percent = org_dt.isnull().sum() * 100 / len(org_dt)
miss_orgdt = pd.DataFrame({'missing (count)': org_dt.isnull().sum(),'missing (%)': missing_percent})
miss_orgdt.sort_values(by=['missing (%)'], ascending=False)

Unnamed: 0,missing (count),missing (%)
assist_tackle_4_team,25612,100.0
assist_tackle_3_player_id,25612,100.0
assist_tackle_3_player_name,25612,100.0
lateral_rusher_player_id,25612,100.0
lateral_rusher_player_name,25612,100.0
...,...,...
extra_point_prob,0,0.0
total_away_score,0,0.0
total_home_score,0,0.0
away_timeouts_remaining,0,0.0


## Assessment of missing data points and values
____
 
**TODO**

* Look at the # of missing points in all nonzero columns sorted descending. 

## Drop missing values
___

**TODO**

* Evaluate removing all rows with missing values
* Evaluate removing all columns with at least one missing value
* Compare the original dataframe and the filtered ones.

## Fill in missing values
_____

**TODO**

* Select a subset of the dataset
* Evaluate replacing all NaNs with 0
* Evaluate replacing all NaNs with the value of the next value in the same column
* Compare the original dataframe and the one with the filled NaNs
