<a href="https://colab.research.google.com/github/christophermalone/DSCI325/blob/main/Module3_Part2_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3 | Part 2 | Python: Data Verb - FILTER()

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Example 3.2.P
For this notebook, we will consider airline data from the Bureau of Transportation.  Using the form provided on their website, one is able to obtain a variety of information around flight delays.
 

The following 17 fields will be considered here:

*   Day Information: DAY_OF_MONTH, DAY_OF_WEEK
*   Origin Information: ORIGIN, ORIGIN_STATE
*   Destination Information: DEST, DEST_STATE
*   Departure Information: DEP_TIME, DEP_DELAY, DEP_DELAY15, 
*   Arrival Information: ARR_TIME, ARR_DELAY, ARR_DEL15, 
*   Reason for Delay: CARRIER_DELAY, WEATHER_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY


<br>Data Source:  https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


The following command will import the <strong>pandas</strong> package. The local name for the pandas suite of functions is pd here.

In [None]:
import pandas as pd

Next, read the Ontime_Reporting.csv file into Python.

In [None]:
OnTime = pd.read_csv("/content/sample_data/Ontime_Reporting.csv") 

Using the head() function to display the first few rows of this dataframe.

In [None]:
OnTime.head()

Unnamed: 0,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,ORIGIN_STATE,DEST,DEST_STATE,DEP_TIME,DEP_DELAY,DEP_DEL15,ARR_TIME,ARR_DELAY,ARR_DEL15,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,1,2,CLT,NC,MCO,FL,1252.0,-7.0,0.0,1421.0,-19.0,0.0,,,,,
1,1,2,MCO,FL,CLT,NC,1525.0,-11.0,0.0,1701.0,-20.0,0.0,,,,,
2,1,2,DFW,TX,MCO,FL,840.0,-5.0,0.0,1200.0,-13.0,0.0,,,,,
3,1,2,MCO,FL,DFW,TX,1328.0,-5.0,0.0,1530.0,-5.0,0.0,,,,,
4,1,2,EWR,NJ,DFW,TX,604.0,-6.0,0.0,835.0,-47.0,0.0,,,,,


Using the <strong>info()</strong> method or function to identify the structure of the dataframe.

In [None]:
OnTime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371357 entries, 0 to 371356
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   DAY_OF_MONTH         371357 non-null  int64  
 1   DAY_OF_WEEK          371357 non-null  int64  
 2   ORIGIN               371357 non-null  object 
 3   ORIGIN_STATE         371357 non-null  object 
 4   DEST                 371357 non-null  object 
 5   DEST_STATE           371357 non-null  object 
 6   DEP_TIME             367620 non-null  float64
 7   DEP_DELAY            367619 non-null  float64
 8   DEP_DEL15            367619 non-null  float64
 9   ARR_TIME             367418 non-null  float64
 10  ARR_DELAY            366940 non-null  float64
 11  ARR_DEL15            366940 non-null  float64
 12  CARRIER_DELAY        43532 non-null   float64
 13  WEATHER_DELAY        43532 non-null   float64
 14  NAS_DELAY            43532 non-null   float64
 15  SECURITY_DELAY   

Using the <strong>shape</strong> attribute to identify the number of rows and columns in the dataframe. 

In [None]:
OnTime.shape

(371357, 17)

# Using dfply package

The following snipit of code will install the dfply package.

In [None]:
pip install dfply



The dfply code has been downloaded, not import this code into the current iPython Notebook using the following code.


In [None]:
from dfply import *

To begin, suppose the goal is to obtain only flights whose ORIGIN airport was Rochester, MN.  The airport code for Rochester, MN is <strong>RST</strong>.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1cZ49YcqPChfiBZP0Hq7Ahzxn4Sg3ktG-" width='25%' height='25%'></p>

In [None]:
#Piping in dfply and using filter_by() to grab RST rows.
RST = (
          OnTime
          >> filter_by(X.ORIGIN == "RST")
        )
 
RST.shape

(238, 17)

Next, let us collect the rows where the ORIGIN airport is RST and the destination airport is MSP, i.e. Minneapolis, MN.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1c142gdrEPwS_j09pqK1x8kpQjYDYSs47" width='25%' height='25%'></p>

In [None]:
#Piping in dfply and using filter_by() to grab RST to MSP rows.
RST_to_MSP = (
          OnTime
          >> filter_by(X.ORIGIN == "RST", X.DEST == "MSP")
        )
 
RST_to_MSP.shape

(118, 17)

Next, let us collect the rows where the ORIGIN airport is RST or the ORIGIN airport is DLH, i.e. Duluth, MN.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1zQzTj9PJkooEfkkaioffNW1O49Cocy7Y" width='25%' height='25%'></p>

The <strong>OR</strong> condition can be applied using a vertical bar, i.e. |.  The filter_by() function does require one to carefully specify or contain the conditions.  Consider the following.

<p align='center'>
<strong>filter_by( (statement 1) | (statement 2) )
</strong>
</p>


*   The following code does **not** work: filter_by( X.ORIGIN == "RST" | X.ORIGIN == "DLH" ) 
*   The following code does work: filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH" ) )





In [None]:
#Piping in dfply and using filter_by() to grab requested rows.
RST_or_DLH = (
          OnTime
          #>> filter_by( X.ORIGIN == "RST" | X.ORIGIN == "DLH" )  #This line does not work as OR condition is vague
          >> filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH") )
          
        )
 
RST_or_DLH.shape

(359, 17)

Next, suppose the requested rows are flights leaving from RST OR DLH that are flying into MSP.

In [None]:
#Piping in dfply and using filter_by() to grab requested rows.
RST_or_DLH_to_MSP = (
          OnTime
          >> filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH") )
          >> filter_by(X.DEST == "MSP") 
        )
 
RST_or_DLH_to_MSP.shape

(239, 17)

Next, let us collect the rows where the ORIGIN airport is in MN and purposely exclude MSP.  This will be done here with a sequence of <strong>OR</strong> statements.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1hMSS_XOgk5izrSa1PNvKSiWkM62c9xMn" width='25%' height='25%'></p>

In [None]:
#Piping in dfply and using filter_by() to grab requested rows.
MN_Airports = (
          OnTime
          >> filter_by(X.ORIGIN_STATE == "MN")
          >> distinct(X.ORIGIN)
          >> select(X.ORIGIN) 
        )
MN_Airports

Unnamed: 0,ORIGIN
14,MSP
4498,RST
5294,INL
5320,HIB
5333,BJI
5345,BRD
5763,DLH
54997,STC


Similar to R, there exists an option for checking across many values using the <strong>IN</strong> feature.  To invoke this procedure, create a vector containing the various regional airports in MN (excluding MSP). The .isin() method can be used to check the ORIGIN airport against this list.


Other methods are commonly invoked using this apporach. For example, contains() and startswith() can be implmented using X.ORIGIN.contains() or X.ORIGIN.startswith() syntax.

In [None]:
#Piping in dfply and using filter_by() to grab requested rows.
MN_Airport_List = ["BJI","BRD","DLH","HIB","INL","RST","STC"]
All_MN_NoMSP = (
          OnTime
          >> filter_by(X.ORIGIN.isin(MN_Airport_List) )
        )
All_MN_NoMSP.shape

(593, 17)

Of course, the checking against all airports in MN (excluding MSP), can be done with a sequence of OR conditions.

In [None]:
#Piping in dfply and using filter_by() to grab requested rows.
All_MN_NoMSP = (
          OnTime
          >> filter_by( (X.ORIGIN == "BJI") | (X.ORIGIN == "BRD") | (X.ORIGIN == "DLH") | (X.ORIGIN == "HIB") | (X.ORIGIN == "INL") | (X.ORIGIN == "RST") |(X.ORIGIN == "STC") )
        
        )
All_MN_NoMSP.shape

(593, 17)

The last iteration of getting the rows for the MN regional airports using the <stong>NOT</strong> condition.  There are two ways to invoke the NOT condition.


*   The ~ character can be used for the NOT condition
*   The != syntax can also be usef for the NOT condition



In [None]:
All_MN_NoMSP = (
          OnTime
         >> filter_by( X.ORIGIN_STATE == "MN" )
         #>> filter_by( ~ X.ORIGIN == "MSP" )      #Using ~ to invoke NOT
         >> filter_by( X.ORIGIN != "MSP" )         #Using != to invoke NOT
        )
All_MN_NoMSP.shape

(593, 17)

# Filter Action with NA


The following code is used to <strong>drop all </strong> missing from a dataframe.

In [None]:
#Get all flights from RST and exclude all flights that have any missing, i.e. NaN
RST = (
         OnTime
         >> filter_by( X.ORIGIN == "RST" )
      )
#The dropna() method will drop all missingness in your dataframe
RST_NoNaN = RST.dropna()
RST_NoNaN.shape


(20, 17)

In the following variation, only flights that have missingness for CARRIER_DELAY will be excluded.  This method is applied inside the filter_by() method here.

In [None]:
RST_NoNaN_CarrierDelay = (
         OnTime
         >> filter_by( X.ORIGIN == "RST" )
         >> filter_by(X.CARRIER_DELAY.notnull())
      )
RST_NoNaN_CarrierDelay.shape
#RST_NoNaN_CarrierDelay.head()


(20, 17)