### All days of the challange:

* [Day 1: Handling missing values](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values)

* A Data cleaning challange by Rachael Tatman.



Here's what I'm going to focus on   today:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

# Take a first look at the data
________

The first thing I''ll need to do is load in the libraries and datasets. For today, I'll be using a  dataset of building permits issued in San Francisco.

>

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our datad

nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

#let's play with  sf_permits dataset

In [None]:
sf_permits.sample(5)

In [None]:
#lets count total number of cell in dataframe
totalcells=np.product(sf_permits.shape)

# See how many missing data points we have
___

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [None]:
# get the number of missing data points per column
missing_values_count = sf_permits.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [None]:
# percent of the sf_permits dataset is missing
total_cells = np.product(sf_permits.shape)
total_missing= missing_values_count.sum()

#percentage of data that is missing
(total_missing/total_cells)*100

# Figure out why the data is missing
____
 

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]


It look liket that columns 'Street Number Suffix ' from 'sf_permits' contain  high missing value . their is two possible reason  for these missing:
1. These are missing because they don't exist.
2. They aren't recorded



# Drop missing values
___




In [None]:
# remove all the rows that contain a missing value
sf_permits.dropna()

Oh dear, it looks like that's removed all our data!

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = sf_permits.dropna(axis=1)
columns_with_na_dropped.head()

In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % sf_permits.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. 

# Filling in missing values automatically
_____

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the permits  data so that it will print well.

In [None]:
# get a small subset of the permits  dataset
subset_sf_permits = sf_permits.loc[:, 'Permit Number':'Street Suffix'].head()
subset_sf_permits



We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0.

In [None]:
# replace all NA's with 0
subset_sf_permits.fillna(0)

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_sf_permits.fillna(method = 'bfill', axis=0).fillna(0)

**Filling in missing values is also known as "imputation", **