<a href="https://colab.research.google.com/github/tbonne/IntroDataScience/blob/main/InClassNotebooks/IntroData3_MissingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=100E8FHwZxTg2d27HMWeXf05Av4IIqXT9' width="300" align = 'left'> 

# <font color='lightblue'>Missing Data</font>

In this exercise we will learn how to handle missing data. 

Outline:
*  How to find missing data
*  How to remove missing data (if appropriate!)



## <font color='lightblue'>Load Data</font>

<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width="100" align = 'left'>

First load in the pandas library and bring in the NYC flight data

In [None]:
#Import pandas library
??

#read in the NYC flight data
df_flights = ??

Give a thumbs up in Slack once your done!

# <font color='lightblue'>Identifying Missing Data</font>

Most data that you'll find will have missing values, or values that cannot be true (i.e., errors). Here we will look at how to find missing values, and how to handle them.

Below we use the method **isna()** to identify cells in the dataframe where the values are **NA**. We use the sum function ontop of the **isna** method to count the number of **NA** values.

> Note: when you combine methods in this way the outputs of the first (going left to right) act as inputs for the second. This is called method chaining, and we will use it with pandas objects!

In [None]:
#count Null values for each column of the dataframe
df_flights.isna().sum()

Unnamed: 0         0
year               0
month              0
day                0
dep_time          28
sched_dep_time     0
dep_delay         28
arr_time          31
sched_arr_time     0
arr_delay         47
carrier            0
flight             0
tailnum            6
origin             0
dest               0
air_time          47
distance           0
hour               0
minute             0
time_hour          0
dtype: int64

So we can see that there are not that many missing values in this dataframe. Let's take a look at how many rows of data we have:

In [None]:
len(df_flights)

3614

And at what proportion of data is missing:

In [None]:
df_flights.isnull().sum() / len(df_flights)

Unnamed: 0        0.000000
year              0.000000
month             0.000000
day               0.000000
dep_time          0.007748
sched_dep_time    0.000000
dep_delay         0.007748
arr_time          0.008578
sched_arr_time    0.000000
arr_delay         0.013005
carrier           0.000000
flight            0.000000
tailnum           0.001660
origin            0.000000
dest              0.000000
air_time          0.013005
distance          0.000000
hour              0.000000
minute            0.000000
time_hour         0.000000
dtype: float64

We can take a look at these missing values within the dataframe by opening the dataframe in colab (Files - dubble click on the file - then filter by NA)

So we can see in this dataset there is very little in the way of missing data! But what should we do with those data entries? We could:

1. Understand how/why they are missing (data story!)
1. Remove those data entries (missing at random?)
2. Fill them in with estimates (impute the missing data)

# <font color='lightblue'>Drop Missing Data</font>

Below we will remove the rows with missing data in one column. So if there is no data in this column then it will remove that entire row from the data.

In [None]:
#drop rows if air time contains missing values
df_flights_airtime_na = df_flights[df_flights.air_time.isna()==False]

#take a look at the new length of the dataframe
len(df_flights_airtime_na)


3567

It is also possible to remove any row that has missing values.

In [None]:
#drop rows if any column contain missing values
df_flights_sub_na = df_flights.dropna(how='any')

len(df_flights_sub_na)

3567

<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width="100" align = 'left'>

Try removing all the rows with NAs in arr_time.

In [None]:
#drop rows if arrival time contains missing values
df_flights_arr_time_na = ?

#check the length

# <font color='lightblue'>Add column of missingness</font>

Here the idea is that missing data might be useful for making predictions. Let's add another column to the DataFrame to identify missing data in *air time*.

> Here we use method chaining to select a column 'air_time', see if each row has an NA value. This gives us true and false values for each row.

In [None]:
#create new column with true/false if there is missing data in air time
df_flights['missingAirTime'] = df_flights.air_time.isna()

#take a look 
df_flights

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,missingAirTime
0,1,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00,0
1,2,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00,0
2,3,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00,0
3,4,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00,0
4,5,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3609,3610,2013,1,4,,1830,,,2044,,9E,3716,,EWR,DTW,,488,18,30,2013-01-04 18:00:00,1
3610,3611,2013,1,4,,920,,,1245,,AA,721,N541AA,LGA,DFW,,1389,9,20,2013-01-04 09:00:00,1
3611,3612,2013,1,4,,1245,,,1550,,AA,745,N3BGAA,LGA,DFW,,1389,12,45,2013-01-04 12:00:00,1
3612,3613,2013,1,4,,1430,,,1735,,AA,883,N200AA,EWR,DFW,,1372,14,30,2013-01-04 14:00:00,1


Let's add one more step to that method chain, and convert the true false into integers. To do so we'll add the method **asType('int')**, which converts the true/false into 1/0.

In [None]:
#create new column with true/false if there is missing data in air time
df_flights['missingAirTime'] = df_flights.air_time.isna().astype('int')

#take a look 
df_flights

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,missingAirTime
0,1,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00,0
1,2,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00,0
2,3,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00,0
3,4,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00,0
4,5,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3609,3610,2013,1,4,,1830,,,2044,,9E,3716,,EWR,DTW,,488,18,30,2013-01-04 18:00:00,1
3610,3611,2013,1,4,,920,,,1245,,AA,721,N541AA,LGA,DFW,,1389,9,20,2013-01-04 09:00:00,1
3611,3612,2013,1,4,,1245,,,1550,,AA,745,N3BGAA,LGA,DFW,,1389,12,45,2013-01-04 12:00:00,1
3612,3613,2013,1,4,,1430,,,1735,,AA,883,N200AA,EWR,DFW,,1372,14,30,2013-01-04 14:00:00,1


<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width="100" align = 'left'>

Try adding a column of missingness for plane tail number (i.e., tailnum).

In [None]:
#create new column with 1/0 if there is missing data in tailnum
df_flights[?] = ?

#take a look
?


# <font color='lightblue'>Further reading</font>

There are many ways to deal with [missing data with pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html). Just remember that some ways are more justifyable than others, and a good understanding of how the data came to be is key in deciding which ways might work best.