# The Airport dataset
For some simple data cleaning and preparation studies we will use an airport dataset from this [website](https://github.com/ismayc/pnwflights14): 
It contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total.

In [0]:
import pandas as pd

flights = pd.read_csv("https://raw.githubusercontent.com/big-data-analytics-physics/data/master/flights/flights.csv")
print(flights.head())

# Dealing with missing data
Sometimes you have datasets in which there are subsets of the data which are missing some features. 

How do we find these rows?  If the data is truly missing (meaning there is nothing in the place where it should be)  we can use a tool in pandas to find it: Let's use google: pandas dataframe find rows with nan

The answer looks something like this: df1 = df[df.isnull().any(axis=1)]

So let's try it with our flights dataframe:

In [0]:
flights_nulls = flights[flights.isnull().any(axis=1)]
print(flights_nulls.head(20))

Notice the first column printed out above - the one with 408, then 409, etc.   This is the pandas dataframe **index**.  If you watched the video I pointed the class to, this should be familiar.

We can print out some of these same rows from the **original** dataframe by using this index.   We can do this using "**loc**" or "**iloc**".   The difference being:
1.  loc gets rows (or columns) with particular labels from the index. 
2.  iloc gets rows (or columns) at particular positions in the index (so it only takes integers)

In our original dataframe (flights) the *label* is the same as the *position*.   This is **not** true for the derived dataframe flights_nulls.    
 
 Here is a specific row from the original dataframe:

In [0]:
print(flights.loc[408,:])
print(flights.iloc[408][:])

Here are rows from the derived dataframe:

In [0]:
print(flights_nulls.loc[408,:])    ## this uses the index label
print(flights_nulls.iloc[0][:])    ## this uses the position - the zeroth row

Back to dealing with missing data!

Here are some options:

1.  Simply remove rows with missing data.
2.  Replace the missing data with the mean of that column.
3.  Replace the missing data with the mean of columns that are similar to that row. For exmple, we could choose rows that are geographically similar.

It is important to think carefully about the data when choosing which option to use.  It might not make sense to replace missing values with the means of the respective columns.    On the other hand removing every row which has a missing value might remove too much data.

Here is how you would remove rows with missing data:

In [0]:
flights_nonulls = flights.dropna()
print("Length of flights DF:",len(flights))
print("Length of flights with no nulls DF:",len(flights_nonulls))
print("Length of flights with nulls DF:",len(flights_nulls))




Removing rows with no nulls only affects 0.8% of the data, so it is probably the easiest approach.

Just for completeness, lets also look at how to replace null data with the mean: 

**NOTE** This will take some time!

In [0]:
flights.fillna(flights.mean(),inplace=True)
print(flights.loc[408,:])

We see that the row we looked at previously (rownum=408) has missing values filled in with the means from those rows (this will take some time as well!):

In [0]:
print("Column means\n",flights.mean())