<a href="https://colab.research.google.com/github/christophermalone/DSCI325/blob/main/Module3_Part2_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3 | Part 2 | Python: Data Verb - FILTER()

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Example 3.2.P
For this notebook, we will consider airline data from the Bureau of Transportation.  Using the form provided on their website, one is able to obtain a variety of information around flight delays.
 

The following 17 fields will be considered here:

*   Day Information: DAY_OF_MONTH, DAY_OF_WEEK
*   Origin Information: ORIGIN, ORIGIN_STATE
*   Destination Information: DEST, DEST_STATE
*   Departure Information: DEP_TIME, DEP_DELAY, DEP_DELAY15, 
*   Arrival Information: ARR_TIME, ARR_DELAY, ARR_DEL15, 
*   Reason for Delay: CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY


<br>Data Source:  https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


The following command will import the <strong>pandas</strong> package. The local name for the pandas suite of functions is pd here.

In [40]:
import pandas as pd

Next, read the Ontime_Reporting.csv file into Python.

In [41]:
OnTime = pd.read_csv("/content/sample_data/OnTime_Reporting.csv", index_col=False) 

Using the head() function to display the first few rows of this dataframe.

In [42]:
OnTime.head()

Unnamed: 0,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,DEP_TIME,DEP_DELAY,DEP_DEL15,ARR_TIME,ARR_DELAY,ARR_DEL15,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,1,1,SEA,WA,YKM,WA,1258.0,-2.0,0.0,1346.0,-9.0,0.0,,,,,
1,1,1,BOI,ID,SEA,WA,1536.0,-9.0,0.0,1621.0,-14.0,0.0,,,,,
2,1,1,SEA,WA,YKM,WA,2308.0,-4.0,0.0,5.0,8.0,0.0,,,,,
3,1,1,YKM,WA,SEA,WA,526.0,-4.0,0.0,601.0,-9.0,0.0,,,,,
4,1,1,IDA,ID,SEA,WA,902.0,2.0,0.0,1048.0,28.0,1.0,0.0,0.0,28.0,0.0,0.0


Using the <strong>info()</strong> method or function to identify the structure of the dataframe.

In [43]:
OnTime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547559 entries, 0 to 547558
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   DAY_OF_MONTH         547559 non-null  int64  
 1   DAY_OF_WEEK          547559 non-null  int64  
 2   ORIGIN               547559 non-null  object 
 3   ORIGIN_STATE_ABR     547559 non-null  object 
 4   DEST                 547559 non-null  object 
 5   DEST_STATE_ABR       547559 non-null  object 
 6   DEP_TIME             544272 non-null  float64
 7   DEP_DELAY            544263 non-null  float64
 8   DEP_DEL15            544263 non-null  float64
 9   ARR_TIME             543991 non-null  float64
 10  ARR_DELAY            543401 non-null  float64
 11  ARR_DEL15            543401 non-null  float64
 12  CARRIER_DELAY        81673 non-null   float64
 13  WEATHER_DELAY        81673 non-null   float64
 14  NAS_DELAY            81673 non-null   float64
 15  SECURITY_DELAY   

Using the <strong>shape</strong> attribute to identify the number of rows and columns in the dataframe. 

In [44]:
OnTime.shape

(547559, 17)

# Using dfply package

The following snipit of code will install the dfply package.

In [45]:
pip install dfply



The dfply code has been downloaded, not import this code into the current iPython Notebook using the following code.


In [46]:
from dfply import *

To begin, suppose the goal is to obtain only flights whose ORIGIN airport was Rochester, MN.  The airport code for Rochester, MN is <strong>RST</strong>.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1cZ49YcqPChfiBZP0Hq7Ahzxn4Sg3ktG-" width='25%' height='25%'></p>

In [48]:
#Piping in dfply and using filter_by() to grab RST rows.
RST = (
          OnTime
          >> filter_by(X.ORIGIN == "RST")
        )
 
RST.shape

(183, 17)

Next, let us collect the rows where the ORIGIN airport is RST and the destination airport is MSP, i.e. Minneapolis, MN.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1c142gdrEPwS_j09pqK1x8kpQjYDYSs47" width='25%' height='25%'></p>

In [49]:
#Piping in dfply and using filter_by() to grab RST to MSP rows.
RST_to_MSP = (
          OnTime
          >> filter_by(X.ORIGIN == "RST", X.DEST == "MSP")
        )
 
RST_to_MSP.shape

(85, 17)

Next, let us collect the rows where the ORIGIN airport is RST or the ORIGIN airport is DLH, i.e. Duluth, MN.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1zQzTj9PJkooEfkkaioffNW1O49Cocy7Y" width='25%' height='25%'></p>

The <strong>OR</strong> condition can be applied using a vertical bar, i.e. |.  The filter_by() function does require one to carefully specify or contain the conditions.  Consider the following.

<p align='center'>
<strong>filter_by( (statement 1) | (statement 2) )
</strong>
</p>


*   The following code does **not** work: filter_by( X.ORIGIN == "RST" | X.ORIGIN == "DLH" ) 
*   The following code does work: filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH" ) )





In [50]:
#Piping in dfply and using filter_by() to grab requested rows.
RST_or_DLH = (
          OnTime
          #>> filter_by( X.ORIGIN == "RST" | X.ORIGIN == "DLH" )  #This line does not work as OR condition is vague
          >> filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH") )
          
        )
 
RST_or_DLH.shape

(329, 17)

Next, suppose the requested rows are flights leaving from RST OR DLH that are flying into MSP.

In [51]:
#Piping in dfply and using filter_by() to grab requested rows.
RST_or_DLH_to_MSP = (
          OnTime
          >> filter_by( (X.ORIGIN == "RST") | (X.ORIGIN == "DLH") )
          >> filter_by(X.DEST == "MSP") 
        )
 
RST_or_DLH_to_MSP.shape

(231, 17)

Next, let us collect the rows where the ORIGIN airport is in MN and purposely exclude MSP.  This will be done here with a sequence of <strong>OR</strong> statements.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1hMSS_XOgk5izrSa1PNvKSiWkM62c9xMn" width='25%' height='25%'></p>

In [53]:
#Piping in dfply and using filter_by() to grab requested rows.
MN_Airports = (
          OnTime
          >> filter_by(X.ORIGIN_STATE_ABR == "MN")
          >> distinct(X.ORIGIN)
          >> select(X.ORIGIN) 
        )
MN_Airports

Unnamed: 0,ORIGIN
509,MSP
730,DLH
853,INL
862,BRD
883,HIB
904,BJI
4707,RST
45364,STC


Similar to R, there exists an option for checking across many values using the <strong>IN</strong> feature.  To invoke this procedure, create a vector containing the various regional airports in MN (excluding MSP). The .isin() method can be used to check the ORIGIN airport against this list.


Other methods are commonly invoked using this apporach. For example, contains() and startswith() can be implmented using X.ORIGIN.contains() or X.ORIGIN.startswith() syntax.

In [54]:
#Piping in dfply and using filter_by() to grab requested rows.
MN_Airport_List = ["BJI","BRD","DLH","HIB","INL","RST","STC"]
All_MN_NoMSP = (
          OnTime
          >> filter_by(X.ORIGIN.isin(MN_Airport_List) )
        )
All_MN_NoMSP.shape

(547, 17)

Of course, the checking against all airports in MN (excluding MSP), can be done with a sequence of OR conditions.

In [55]:
#Piping in dfply and using filter_by() to grab requested rows.
All_MN_NoMSP = (
          OnTime
          >> filter_by( (X.ORIGIN == "BJI") | (X.ORIGIN == "BRD") | (X.ORIGIN == "DLH") | (X.ORIGIN == "HIB") | (X.ORIGIN == "INL") | (X.ORIGIN == "RST") |(X.ORIGIN == "STC") )
        
        )
All_MN_NoMSP.shape

(547, 17)

The last iteration of getting the rows for the MN regional airports using the <stong>NOT</strong> condition.  There are two ways to invoke the NOT condition.


*   The ~ character can be used for the NOT condition
*   The != syntax can also be usef for the NOT condition



In [56]:
All_MN_NoMSP = (
          OnTime
         >> filter_by( X.ORIGIN_STATE_ABR == "MN" )
         #>> filter_by( ~ X.ORIGIN == "MSP" )      #Using ~ to invoke NOT
         >> filter_by( X.ORIGIN != "MSP" )         #Using != to invoke NOT
        )
All_MN_NoMSP.shape

(547, 17)

# Filter Action with NA


The following code is used to <strong>drop all </strong> missing from a dataframe.

In [59]:
#Get all flights from RST and exclude all flights that have any missing, i.e. NaN
RST = (
         OnTime
         >> filter_by( X.ORIGIN == "RST" )
      )
#The dropna() method will drop all missingness in your dataframe
RST_NoNaN = RST.dropna()
RST_NoNaN.shape


(14, 17)

In the following variation, only flights that have missingness for CARRIER_DELAY will be excluded.  This method is applied inside the filter_by() method here.

In [60]:
RST_NoNaN_CarrierDelay = (
         OnTime
         >> filter_by( X.ORIGIN == "RST" )
         >> filter_by(X.CARRIER_DELAY.notnull())
      )
RST_NoNaN_CarrierDelay.shape
#RST_NoNaN_CarrierDelay.head()


(14, 17)