**This notebook is an exercise in the [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/handling-missing-values).**

---


In this exercise, you'll apply what you learned in the **Handling missing values** tutorial.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# 1) Take a first look at the data

Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
nfl_data = pd.read_csv("../input/nfl-play-by-play-20092016-v3csv/NFL Play by Play 2009-2016 (v3).csv")

# set seed for reproducibility
np.random.seed(0) 

Use the code cell below to print the first five rows of the `sf_permits` DataFrame.

In [None]:
sf_permits.sample(5)

# 2) How many missing data points do we have?

What percentage of the values in the dataset are missing?  Your answer should be a number between 0 and 100.  (If 1/4 of the values in the dataset are missing, the answer is 25.)

In [None]:
# get the number of missing data points per column
missing_values_count = sf_permits.isnull().sum()


In [None]:
# how many total missing values do we have?
total_cells = np.product(sf_permits.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

# 3) Figure out why the data is missing

Look at the columns **"Street Number Suffix"** and **"Zipcode"** from the [San Francisco Building Permits dataset](https://www.kaggle.com/aparnashastry/building-permit-applications-data). Both of these contain missing values. 
- Which, if either, are missing because they don't exist? 
- Which, if either, are missing because they weren't recorded?  

Once you have an answer, run the code cell below.

In [None]:
# get a subset of the sf_permits dataset with the columns with values 
# are missing because it wasn't recorded (1) or because it doesn't exist (2)?
# "Zipcode" belongs to category 1 and "Street Number Suffix" belongs to category 2
sf_permits_subset = sf_permits[ ['Street Number Suffix','Zipcode']]
sf_permits_subset

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

# 4) Drop missing values: rows

If you removed all of the rows of `sf_permits` with missing values, how many rows are left?

**Note**: Do not change the value of `sf_permits` when checking this.  

In [None]:
# remove all the rows that contain a missing value
sf_permits.dropna()


# 5) Drop missing values: columns

Now try removing all the columns with empty values.  
- Create a new DataFrame called `sf_permits_with_na_dropped` that has all of the columns with empty values removed.  
- How many columns were removed from the original `sf_permits` DataFrame? Use this number to set the value of the `dropped_columns` variable below.

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = sf_permits.dropna(axis=1)
columns_with_na_dropped.head()

In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % sf_permits.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

# 6) Fill in missing values automatically

Try replacing all the NaN's in the `sf_permits` data with the one that comes directly after it and then replacing any remaining NaN's with 0.  Set the result to a new DataFrame `sf_permits_with_na_imputed`.

In [None]:
# get a subset of the sf_permits dataset with the columns with missing values
sf_permits_data = sf_permits[ ['Street Number Suffix','Street Suffix']].head()
sf_permits_data

In [None]:
# get a small subset of the NFL dataset
sf_permits_data = sf_permits.loc[:, 'Block':'Site Permit'].head()
sf_permits_data

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
sf_permits_data.fillna(method='bfill', axis=0).fillna(0)

In [None]:
#additional tests:

In [None]:
# replace all NA's with 0 (I put it in a new dataframe)
tmp_data = sf_permits_data.fillna(0)

In [None]:
tmp_data

In [None]:
# get the new number of missing data points per column
missing_values_count = tmp_data.isnull().sum()

In [None]:
# how many total missing values do we have in the new data-frame?
total_cells = np.product(tmp_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

In [None]:
#write cleaned data-frame in new csv-file
tmp_data.to_csv("cleaned-data.csv", sep='\t', encoding='utf-8')