##Take the first look at the data
Import the dataset and load the libraries. The datasets to work on are a dataset of events that occured in American Football games for demonstration and  a dataset of building permits issued in San Francisco.

In [1]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 


* Look inside the datasets to check the data, see if there're some missing values.

In [2]:
#sample 5 rows from the nfl_data file. Already many NaNs
nfl_data.sample(5)

In [3]:
#sample from sf_permits dataset. Hadfull of NaNs likewise. 
sf_permits.sample(5)

##Check how many missing data points we have

In [5]:
#get the number of missing data points per column
missing_val_count=nfl_data.isnull().sum()
#look at the number of missing values in the first ten columns
missing_val_count[0:10]

Get what percentage of the values in the dataset were missing to give a better sense of the scale if this problem

In [7]:
# how many total missing values do we have?
total_cells=np.product(nfl_data.shape)
total_missing=missing_val_count.sum()

#percentage of data that is missing
(total_missing/total_cells)*100


Almost 25% of this dataset is empty. Check what's going on in the sf_permits dataset

In [11]:
missing_sf_count=sf_permits.isnull().sum()
missing_sf_count[0:10]
#many missing data in some columns too

In [12]:
#find the percentage of misising values by deviding total cells by missing cells
total_cells_sf=np.product(sf_permits.shape)
total_missing_sf = missing_sf_count.sum()

#percentage of data that is missing
(total_missing_sf/total_cells_sf)*100

#Here over 25% of the observations is missing

## Find out why the data is missing

>Are the records missing because it wasn't recorded or because it doesn't exist?

If it doesn't exist, it should be kept as NaN. If it wasn't recorded, it's better to replace it with some guess (imputation).

**nfl_data**
Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column TimesSec has a lot of missing values in it.
By looking at the documentation, I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like PenalizedTeam that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say which team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

**sf_permit**
 Street Number Suffix has many missing values, the description says it's related to the address. Most likely, NaN in this case means that a suffix for a particular street does not exist. Zipcode variable is a zipcode of building address, NaN means it wasn't recorded.

##Drop missing values
If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)

If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. Let's try it out on our NFL dataset!

In [13]:
# remove all the rows that contain a missing value
nfl_data.dropna()


In [14]:
#it dropped all of the data, because every row had an NaN value. Try dropping columns with NaNs instead:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

In [15]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

In [17]:
#clean NaN from sf_permit dataset
sf_permits.dropna()
#again all the data is dropped

In [19]:
#Try dropping columns with NaNs instead:
# remove all columns with at least one missing value
columns_with_na_dropped_sf = sf_permits.dropna(axis=1)
columns_with_na_dropped_sf.head()

In [21]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" %sf_permits.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped_sf.shape[1])

##Filling in missing values automatically

In [22]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

# replace all NA's with 0
subset_nfl_data.fillna(0)

In [24]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna("0")