### All days of the challange:

* [Day 1: Handling missing values](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values)
* [Day 2: Scaling and normalization](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data)
* [Day 3: Parsing dates](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates/)
* [Day 4: Character encodings](https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings/)
* [Day 5: Inconsistent Data Entry](https://www.kaggle.com/rtatman/data-cleaning-challenge-inconsistent-data-entry/)

Here's what we're going to do today:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

In [None]:
# modules we'll use
import pandas as pd
import numpy as np
import math
# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
usa_zipcodes= pd.read_csv("../input/usa-zip-codes-to-locations/US Zip Codes from 2013 Government Data.csv")
# set seed for reproducibility
np.random.seed(0) 

In [None]:
from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))
print(check_output(["ls", "../input/"]).decode("utf8"))


In [None]:

#lat, long 
def haversine(lat1,long1, lat2,long2):
    radius = 6371 # km
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(long2-long1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c
    return d

def findZipcodeForLocation (lat1, long1, zipcodes):
    minDistance =10^400
    zipCode=0
    for i in range(len(zipcodes)):
        currDistance = haversine(lat1,long1, zipcodes.iloc[i,1], zipcodes.iloc[i,2])
        if currDistance < minDistance : 
            minDistance = currDistance
            zipCode = zipcodes.iloc[i,0]
    return int(round(zipCode))


In [None]:
sf_zipcode = pd.read_csv("../input/sf-zipcodes-limited/SFZ.csv")
sf_zipcode.sample(10)

In [None]:
print ("Number of rows in sf zipcodes: %d \n" % sf_zipcode.shape[0] )

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

Yep, it looks like there's some missing values. What about in the sf_permits dataset?

In [None]:
# your turn! Look at a couple of rows from the sf_permits dataset. Do you notice any missing data?
sf_permits.sample(5)


Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

In [None]:
# your turn! Find out what percent of the sf_permits dataset is missing
sf_missing_values_count = sf_permits.isnull().sum()
# look at the # of missing points in the first ten columns
sf_missing_values_count[0:10]
sf_total_cells = np.product(sf_permits.shape)
sf_total_missing = sf_missing_values_count.sum()

# percent of data that is missing
(sf_total_missing/sf_total_cells) * 100

# Figure out why the data is missing

> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. 

# Filling in missing values automatically
_____

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well.

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

Filling in missing values is also known as "imputation", and you can find more exercises on it [in this lesson, also linked under the "More practice!" section](https://www.kaggle.com/dansbecker/handling-missing-values). First, however, why don't you try replacing some of the missing values in the sf_permit dataset?

In [None]:
sf_permits.loc[:, 'Neighborhoods - Analysis Boundaries':'Zipcode'].sample(20)

In [None]:
sf_permits.rename(columns={'Neighborhoods - Analysis Boundaries': 'Neighborhood'}, inplace=True)
sf_nhoods = sf_permits['Neighborhood'].unique()
#sf_nhoods.sort()
sf_nhoods

In [None]:

sf_zipcode.drop(["Unnamed: 2","Unnamed: 3"], axis=1, inplace=True)
sf_zipcode.head()

In [None]:
sfz= sf_zipcode['Neighborhood'].unique()
sfz.sort()
sfz

In [None]:
sf_zipcode.Neighborhood = sf_zipcode.Neighborhood.replace("\xa0", "", regex=True)
sfzz = sf_zipcode.Neighborhood.unique()
sfzz.sort()
sfzz

In [None]:
sf_permits.Neighborhood.isna().sum()
sf_permits.Zipcode.isna().sum()
sf_permits_nn=sf_permits.query('Neighborhood.isnull() and Zipcode.isnull()', engine='python')
sf_permits_nn.shape[0]

In [None]:
sf_permits[['LAT','LNG']] = sf_permits.Location.str.split(',', expand = True)
sf_permits.LAT= sf_permits.LAT.str.replace('(','')
sf_permits.LNG= sf_permits.LNG.str.replace(')','')
sf_permits.LNG.sample(10)
sf_permits.LAT.sample(10)

In [None]:
sf_permits.Zipcode.unique()

In [None]:
sfz_unique= pd.DataFrame(sf_zipcode.Zipcode.unique())
sfz_unique.columns =['Zipcode']
sfz_unique.shape[0]
#sfz_unique.sample(10)

In [None]:
sfzz_unique =pd.DataFrame(sf_permits.Zipcode.unique())
sfzz_unique.columns =['Zipcode']
sfzz_unique.shape[0]

In [None]:
#ca_zipcodes=usa_zipcodes[(usa_zipcodes['ZIP']>=90001) & (usa_zipcodes['ZIP'] <=96162)]
caf_zipcodes = pd.merge(usa_zipcodes,sfz_unique, left_on =['ZIP'], right_on=['Zipcode'],how='inner')
caf_zipcodes.shape[0]

In [None]:
ca_zipcodes.head(5)

In [None]:
# Find the missing zip codes from location column (now split by LAT and LNG) using 
# USA Zip code dataset filtered by California zip codes
sf_permits.sample(5)
sf_permits['Zipcode'] = sf_permits.where(sf_permits['Zipcode'].isna()).apply(lambda row: findZipcodeForLocation(row['LAT'],row['LNG'],ca_zipcodes), axis=1)
sf_permits.Zipcode.isna().sum()

In [None]:

sff_permits = sf_permits.fillna(method = 'bfill', axis=0).fillna(0)

In [None]:
n_m_values_count = sff_permits.isnull().sum()
n_cols = sff_permits.shape[1]
n_m_values_count[0:n_cols]

In [None]:
zsf_missing_values_count = sff_permits.isnull().sum()
# look at the # of missing points in the first ten columns
zsf_total_cells = np.product(sf_permits.shape)
zsf_total_missing = zsf_missing_values_count.sum()
# percent of data that is missing
(zsf_total_missing/zsf_total_cells) * 100

And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers). 

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also let's you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".

# More practice!
___

If you're looking for more practice handling missing values, check out these extra-credit\* exercises:

* [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values): In this notebook Dan shows you several approaches to imputing missing data using scikit-learn's imputer. 
* Look back at the `Zipcode` column in the `sf_permits` dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the [Datasets listing](https://www.kaggle.com/datasets).) 

\* no actual credit is given for completing the challenge, you just learn how to clean data real good :P