Welcome to day 1 of the 5-Day Data Challenge! Today, we're going to be looking at how to deal with missing values. To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

Here's what we're going to do today:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

# Take a first look at the data
________

The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, I'll be using a dataset of events that occured in American Football games for demonstration, and you'll be using a dataset of building permits issued in San Francisco.

> **Important!** Make sure you run this cell yourself or the rest of your code won't work!

In [61]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
#nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv") This is demo data set, comments it
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 
print(sf_permits.shape)

Note: 
   The San Francisco Building Permits data set has:
    1. 198,900 rows(or call samples)
    2. 43 columns( or call features )
    It is first glance of the dataset. we know how big the dataset is.

In [9]:
# your turn! Look at a couple of rows from the sf_permits dataset. Do you notice any missing data?
sf_permits.sample(5)
# your code goes here :)

# See how many missing data points we have
___

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [288]:
# your turn! Find out what percent of the sf_permits dataset is missing
sf_col_missing_count = sf_permits.isnull().sum()
sf_col_nomissing_count = sf_permits.notnull().sum()
sf_col_count = pd.DataFrame({"Valid":sf_col_nomissing_count,"Missing":sf_col_missing_count})

# Understanding the missing data percentage by picture


In [110]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")

fig,ax = plt.subplots(1,2,figsize=(20,5))
sf_col_count.sum().plot.pie(autopct='%.2f',ax=ax[0]) #show the missing celss percentage
ax[0].set_title("Missing data percentage in cell's qty")


IsMissingData_cols = sf_col_count.Missing>0
sns.countplot(IsMissingData_cols,ax=ax[1])
#tmp_col_counter = (sf_col_count>0).sum()  #get the columns number which missing data
#tmp_col_counter.plot.pie(autopct='%.2f',ax=ax[1]) #show the percentage of columns with missing data vs total columns
ax[1].set_title("Missing data column's qty")
print("Missing Data columns qty is {0}".format(IsMissingData_cols.sum()))


In [220]:

ax = sf_col_count.sort_values(by="Missing",ascending =False).plot.bar(stacked=True,figsize = (20,5),rot=-30)
ax.set_title("Dataset Missing numbers - San Francisco Building Permits ")

In [228]:
    sf_col_count["Missing"].sort_values().tail(5)

Note: 

* . **26% cells** are missing data in total 189,900 * 43 cells.
* .** 31 columns** contains missing data in total 43 columsn
      * the top 5 missing columns are:  "TIDF Compliance"," Voluntary Soft-Story Retrofit", "Unit Suffix", "Street Number Suffix", "Site Permit". They are almost all missing value    
         


**Next action:**
consideration the missing data percentage is higher than 10%. It could not be drop directly.

**Reference from Multivariate Data Analysis  chapter 2 : examining your data**
**How Much Missing Data Is Too Much?**
* Missing data **under 10 percent **for an individual case or observation can generally be ignored,  **except** when the missing data occurs in a **specific nonrandom fashion** (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.) [19, 20]
* The number of cases with no missing data **must be sufficient for the selected analysis technique** if replacement values will not be substituted (imputed) for the missing data

# Figure out why the data is missing
____
 
> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

## Your turn!

* Look at the columns `Street Number Suffix` and `Zipcode` from the `sf_permits` datasets. Both of these contain missing values. Which, if either, of these are missing because they don't exist? Which, if either, are missing because they weren't recorded?

In [134]:
check_cols =["Street Number Suffix","Zipcode"]
sf_permits[check_cols].sample(5)

Note：  Per randome sampling, we found 
* almost all "Street Number Suffix" is NaN. That means missing Value. 
* (almost, depends ) No missing value found in zipcode sample.(for this time)

Next actions:
## check the  extent of missing data

In [268]:
fig,ax = plt.subplots(1,1,figsize = (10,5))
ax =sf_col_count.loc[check_cols].T.plot.pie(subplots=True,legend=False,
                                        autopct='%.2f',fontsize=10,                                        
                                       #labels=["",""],
                                       startangle=90,ax=ax)

Note:
1. Stree number Suffix column has more than 99% data missing
2. Zipcode columns has less than 0.9% data missing

To explore the missing data, the first step is to know if the missing date is ignorabe or Non-ignorable: 
* **ignorable missing data,** they are often  expected to be missing and in the control of  research design.  considering the almost all missing of "Street number Suffix", it might be ignorable missing data. 
* **non-ignorable missing data**, they are procedural factors, for example, data entry error, failure to complete all questionair, restriction and etc. Considering the tiny part missing data of "Zipcod", it might be non-ignorable missing data 
Anyway, we need to check data dictionary to verify assumption above. 

Next Action: 

## check the data dictionary

In [133]:
Data_Dictionary = pd.read_excel("../input/building-permit-applications-data/DataDictionaryBuildingPermit.xlsx")

mask = Data_Dictionary["Column name"].isin( check_cols)
print(Data_Dictionary[mask])

Note：
Per check dictionalry description,
* ”Street Number Suffix“  is derived from address feature. In my understanding, it like 2nd level of feature engineering. It might not has result. 
* "Zipcode" is zipcode of building address. It should be 1st level of feature. Usually, it should be filled. Missing data might lead by entry error or lack of information. It could be filled by copy from nearest building zipcode.



# Drop missing values
___

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)  

If you're sure you want to drop rows with missing values, pandas does have a handy function, `dropna()` to help you do this. Let's try it out on our NFL dataset!

In [271]:
# Your turn! Try removing all the rows from the sf_permits dataset that contain missing values. How many are left?
sf_permits.dropna()


**Note:**
*     All data gone after direclty drop all rows with missing data.
*     It is not a applicable way. 

**Next actions: **
   * try drop missing data per columns

In [277]:
# Now try removing all the columns with empty values. Now how much of your data is left?
columns_with_na_dropped_sf =sf_permits.dropna(axis=1)
#columns_with_na_dropped_sf.head()
# just how much data did we lose?
print("Columns in original dataset: %d \n" % sf_permits.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped_sf.shape[1])

**Note:  **
    1.  12 of 43 columns remained.  Looks drop missing data by columns is better than drop by row
    However,  we can do it better. 
  
  **Next actions: **

# Filling in missing values automatically
_____


We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0.

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

Filling in missing values is also known as "imputation", and you can find more exercises on it [in this lesson, also linked under the "More practice!" section](https://www.kaggle.com/dansbecker/handling-missing-values). First, however, why don't you try replacing some of the missing values in the sf_permit dataset?

In [284]:
# Your turn! Try replacing all the NaN's in the sf_permits data with the one that

# comes directly after it and then 
sf_permits_autofillna = sf_permits.fillna(method="bfill",axis=0).fillna(0)


In [286]:
sf_permits_autofillna.isnull().sum().sum()

**Note: **
0 Missing data now. 
By fillna foreward and backward and fillna with 0, the missing data issues seem solved.
* fillna functions has bfill and ffill methos. We just take bfill as example. 

However, 
* bfill or ffill is just for time series related data. We need fill it by more speficific way.
* It will be the work of tomorrow.

Next actions:
1.  Fill missing data with appropriate way.
  * Fill data using sklearn imputer.
      benefit： sklearn imputer could be a components of pipeline.  With pipeline, we can get machine learn 
          * be more abstractive,  focused on process instead of coding
          * be more easy to tuning hypter parameter, one pipeline do all thins
          * be more safe.  Avoid missing preprocessing for testing data and lead to wrong results. 
  
  * Fill data using external databset as reference.
      benefit: 
        Fill missing value & find outlier contribute key values to feature engineering

# More practice! (TBD)
___

If you're looking for more practice handling missing values, check out these extra-credit\* exercises:

* [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values): In this notebook Dan shows you several approaches to imputing missing data using scikit-learn's imputer. 
* Look back at the `Zipcode` column in the `sf_permits` dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the [Datasets listing](https://www.kaggle.com/datasets).) 

\* no actual credit is given for completing the challenge, you just learn how to clean data real good :P