# Step 1 Load in the libraries and datasets


In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

# Step 2: Take look at some data: *.sample()

In [None]:
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)

In [None]:
# your turn! Look at a couple of rows from the sf_permits dataset. Do you notice any missing data?
sf_permits.sample(10)

# Step 3: See how many missing values: *.isnull().sum()

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

In [None]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
# The shape attribute for numpy arrays returns the dimensions of the array. If Y has  n rows and m columns, then Y.shape is (n,m). So Y.shape[0] is n.
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

In [None]:
# your turn! Find out what percent of the sf_permits dataset is missing
missing_value_count1 = sf_permits.isnull().sum()
missing_value_count1[0:10]




In [None]:
total_cells = np.product(sf_permits.shape)
total_missing = missing_value_count.sum()
(total_missing/total_cells)*100

# Step 4: Figure out why the data is missing

> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! 

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

By looking at [the documentation](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016), I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like `PenalizedTeam` that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say *which* team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

# Step 5 (1): Drop missing values: *.dropna()
___
(not generally recommended)
Senario: If you're in a hurry or don't have a reason to figure out why your values are missing, 
Action: remove any rows or columns that contain missing values by `dropna()` 

In [None]:
# remove all the rows that contain a missing value
nfl_data.dropna()

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
# axis : {0 or ‘index’, 1 or ‘columns’}, default 0. Determine if rows or columns which contain missing values are removed. 0, or ‘index’ : Drop rows which contain missing values. 1, or ‘columns’ : Drop columns which contain missing value.
# how : {‘any’, ‘all’}, default ‘any’, Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column.
columns_with_na_dropped.head()
# Dataframe.head() Return the first n rows. It is useful for quickly testing if your object has the right type of data in it.
# Parameters: n : int, default 5. Number of rows to select.

In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. 

In [None]:
# Your turn! Try removing all the rows from the sf_permits dataset that contain missing values. How many are left?
sf_permits.dropna()


In [None]:
# Now try removing all the columns with empty values. Now how much of your data is left?
dropped = sf_permits.dropna(axis = 1)
dropped.head()
print ("Columns in sf_permits %s \n" %sf_permits.shape[1])
print ("Colmuns in the dropped dataset %s \n" %dropped.shape[1])

# Step 5(2): Filling in missing values automatically: *.fillna()

In [None]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Option 1:  specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0.

In [None]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Option 2: replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

In [None]:
sf_permits.head()

In [None]:
# Your turn! Try replacing all the NaN's in the sf_permits data with the one that
# comes directly after it and then replacing any remaining NaN's with 0
filled_with_0=sf_permits.fillna(0)
filled_with_0.head()

In [None]:
filled_with_before=sf_permits.fillna(method = 'bfill', axis=0)
filled_with_before.head()

And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers). 

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also let's you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".

# More practice!
___

If you're looking for more practice handling missing values, check out these extra-credit\* exercises:

* [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values): In this notebook Dan shows you several approaches to imputing missing data using scikit-learn's imputer. 
* Look back at the `Zipcode` column in the `sf_permits` dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the [Datasets listing](https://www.kaggle.com/datasets).) 

\* no actual credit is given for completing the challenge, you just learn how to clean data real good :P