This file describes how to handle missing values using Python.  I have followed `Kaggle` website and have tried different methods of handling missing values using the data set given in the `Kaggle` website of building permits issued in San Francisco.. 

In [None]:
# load the libraries
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

 Once the data set is uploaded, I am going to look at few observations to see the missing values in the data set.  we can do it using `data.sample(10)` command.  The cells with `Nan` or `None` are the missing values.   


In [None]:
# Let's look at 10 observations to see the missing values
sf_permits.sample(10)

Now let's look at how many missing data points in the data set.

# How many missing data points in each column?
___

Now, let's see how many cells have missing values in each column.  The command `data.isnull()` will give `TRUE` for missing value and  `FALSE` for non-missing value . To get the total number of missing values for each column, we can take sum over each columns using  `sum()` command. 

In [None]:
missing_values = sf_permits.isnull()
missing_values_count = missing_values.sum()

#display first 10 columns to see missing counts for each
missing_values_count[0:10]

That seems like a lot!!! Now, we can see what percentage of the values in our dataset were missing. 

In [None]:
total_cells = np.product(sf_permits.shape)   ## Total number of cells
total_missing = missing_values_count.sum()   ## Total number of missing cells
missing_perc = round((total_missing)/(total_cells)*100,2)
missing_perc

We can see that 26% of the cells in this data set are empty. Now, we will look at some of the columns with missing values and see why the data is missing. 

# Figure out why the data is missing
____
 
This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important question you can ask yourself to help figure this out is this:

> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:30]


It is important to know the missing value pattern, is it really a missing value meaning it was not recorded or isn't that value exist. If that value does not exist then it would make sense to leave that cell as `NaN`, if that value is missing because it was not recorded, in this situation, it is important to know how to handle the situation rathen than assuming it is  missing or removing the missing values during the analysis.  There might be situations where you have to remove missing values without performing any imputation to replace the missing values. 

Using the previous output, it is clear that `Street Number Suffix` and `Zipcode` has lots of missing values.

## Drop missing values
___

In most of situations, it is important to see why the values are missing and how to handle the missing values, but this is not practical in many cases, there might not have a way to figure out why these values are missing. In this case, one option is to remove rows or columns that contain missing cells. This can be done using `data.dropna()` command.

In [None]:
# Remove rows and columns if it contains missing values using "sf_permits" data set
sf_permits.dropna()

Ooooh! That might not be a good solution to handle missing values. It removes all the rows, this is because every row in the data set contains at least one empty cell.  What if we remove the columns if at least one cell is missing.  This can be done using `data.dropna(axis=1)` command.  

In [None]:
columns_with_na_dropped = sf_permits.dropna(axis=1)
columns_with_na_dropped.head()

We lost some of the data, but doing this we do not have any missing values in the data set.  We can check how many columns we lost in this process.

In [None]:
# See how much data we lost?
print("columns in original dataset: %d \n" % sf_permits.shape[1])
print("columns with NA's dropped: %d" % columns_with_na_dropped.shape[1])

12 columns have had at least one empty cell and the corresponsing columns have been removed. 

## Filling in missing values automatically
_____

Another option is to try and fill in the missing values. let's consider few columns from the data set to explore how to replace missing values. In order to select columns from `A` to `B` , you can use `data.loc[:, 'A':'B']`

In [None]:
# get a small subset of the NFL dataset
subset_sf_permits = sf_permits.loc[:, 'Permit Number' : 'Zipcode'].head()
subset_sf_permits.head()

Now let's replace all the missing values in the subset with a specific value `0`. This can be done using `subdata_fillna(0)` command available in `Pandas` library.

In [None]:
# replace all NA's in the subset with 0
subset_sf_permits.fillna(0)

It makes more sense to replace a missing value with a value in that column than just replacing with some value. One option is to replace the missing value with the value comes directly after that empty cell in the same column and replace remaining cells with `0`.  This method of missing value handling can be done using `bfill` command. 


In [None]:
# replace all NA's with the value that comes directly after that empty cell in the same columns and replace remaining empty cells with 0
subset_sf_permits.fillna(method='bfill', axis=0).fillna("0")

There are different ways to handle missing values, a detailed representation is given in the [handling missing values section](https://www.kaggle.com/dansbecker/handling-missing-values). 

Next, let's explore how to handle missing values using imputation method. Though the imputation method is not ideal to replace missing values, it would provide more accurate model than ignoring the entire column from the analysis. In some cases, it would be convenient to replace missing values using already available sources, as an example, replacing missing values for `city`, `zip code` etc.  

The imputation  can be done using `Imputer` function in `sklearn` library. Before move into imputation, first we will look the class of the each column to see which variables are continuous and which variables are categorical. This can be done using `dtypes` command in `Pandas`.  


In [None]:
sf_permits.dtypes.sample(10)

#from sklearn.preprocessing import Imputer
#sf_permits_imputer = Imputer()
#sf_permits_with_imputed_values = sf_permits_imputer.fit_transform(sf_permits)

The `object` indicates a column with text which cannot be directly used in most models. The variables with text should be replaced using dummy coding using `get_dummies` function in Pandas. 

In [None]:
encoded_sf_permits = pd.get_dummies(sf_permits)

features.iloc[:,10:].head(5)