**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 13. Filtering Observations in Pandas

## Overview

- Over the next several lessons, we'll learn how to perform 5 fundamental data wrangling tasks with tabular data:

    1. Filter rows by their values
    2. Sort the rows
    3. Select and drop columns
    4. Create new columns with functions of existing columns
    5. Collapse many values down to a summary

- These 5 tasks combined will allow us to solve a majority of our data wrangling challenges

## In this lesson...

- How do we filter, or pick rows, based on their values?


- An aside: how are missing values represented in Pandas?

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## The nycflights13 dataset

- Let's begin by importing Pandas

In [None]:
import pandas as pd

- In the same folder as this notebook is a file `data/nycflights13_flights.csv.zip`
    - This is a ZIP file containing a single CSV file
    - ZIP files not only package multiple files together, they also compress their contents to save disk space

- In this case, there's no need to extract the CSV file: `pd.read_csv()` can read ZIPped CSV files directly:

In [None]:
df = pd.read_csv('data/nycflights13_flights.csv.zip')

- This dataset, based on data from the [US Bureau of Transportation Statistics](https://www.transtats.bts.gov/), contains information about all 336,776 flights that departed from New York City in 2013


- Each row corresponds to one flight, and contains the following data:

| Column | Description |
| :- | :- |
| `year`, `month`, `day` | Date of departure |
| `dep_time`, `arr_time` | Actual departure and arrival times (format HHMM or HMM), local timezone |
| `sched_dep_time`, `sched_arr_time` | Scheduled departure and arrival times (format HHMM or HMM), local timezone |
| `dep_delay`, `arr_delay` | Departure and arrival delays, in minutes; negative times represent early departures/arrivals |
| `carrier` | Two letter carrier abbreviation |
| `flight` | Flight number |
| `tailnum` | Plane tail number |
| `origin`, `dest` | Origin and destination |
| `air_time` | Amount of time spent in the air, in minutes |
| `distance` | Distance between airports, in miles |
| `hour`, `minute` | Time of scheduled departure broken into hour and minutes |
| `time_hour` | Scheduled date and hour of the flight |



- Let's take a quick look:

In [None]:
df.head()

In [None]:
df.info()

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## The query method

- We can filter observations based on their values using the `.query()` method of a DataFrame


- `.query()` takes a Python expression as a *string*, in which the variable names refer to the columns of the DataFrame


- For example, we can select all flights operated by United Airlines (UA) with:

- Note that `df.query()` outputs another DataFrame: it doesn't remove the undesired observations from `df`


- If you want to save the resulting DataFrame, you can assign it to another variable:

### Comparison operators

- The standard Python comparison operators (i.e., `==`, `!=`, `>`, `<`, `>=`, `<=`) all work with `.query()`


- For example, we can select all flights that spent more than 400 minutes in the air like this:

### Logical operators

- The standard Python logical operators (i.e., `and`, `or`, `not`) also work with `.query()`


- For example, we can select all flights on January 1 like this:

### Membership operators

- The standard Python membership operators (i.e., `in`, `not in`) work with `.query()` as well


- For example, we can select all flights headed to Atlanta (ATL), Chicago O'Hare (ORD), and Los Angeles (LAX) like this:

### Using Python variables

- We can refer to variables in the Python environment by prefixing them with `@`


- For example, we can rewrite the query above as follows:

### Column names with spaces

- If one of the columns has a name with a space, we can refer to it by surrounding its name in backticks `` ` ``


- For example, if we wanted to filter for rows whose values of columns `A` and `B B` are not equal, we could write:

    ```python
    df.query('A != `B B`')
    ```

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Missing values

- Let's take a look at flight AA 133 on January 2:

- As we can see, many of the variables for this particular flight have a value of `NaN` (short for "Not a Number")


- By default, `NaN` is how Pandas represents **missing values**: values that, for one reason or another, do not exist


- We can pick rows whose columns have missing values with `.isna()`


- For example, we can filter for the rows with missing tail numbers like this:

- *Note.* If you have the `numexpr` package installed in your Python environment, to use `.isna()` in `.query()`, you need to force `.query()` to use the `python` parser with the keyword argument `engine='python'`


- Perhaps we want to filter for the rows that *do not* have missing tail numbers instead, which we can do like this:

- It turns out that in this case, we can also use the `.notna()` method, like this:

- We'll learn about ways to deal with missing values in a later lesson

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

For the problems below, use the nycflights13 dataset we used in this lesson.

### Problem 1

Find all flights that had an arrival delay of 2 or more hours.

### Problem 2

Find all flights that flew to Houston (IAH or HOU).

### Problem 3

Find all flights that were operated by Southwest (WN), Frontier (F9), or Alaska (AS).
If you didn't already in Problem 2, try using the `in` membership operator.

### Problem 4

Find all flights that departed in the summer (June, July, August, September).

### Problem 5

Find all flights that arrived more than 2 hours late, but didn't leave late.

### Problem 6

Find all flights that departed in the morning (before 1200) *or* in the evening (after 1800) on July 19. 

### Problem 7

Find all flights that were delayed by at least an hour, but made up over 30 minutes in flight.

*Hint.* You can use arithmetic operators (`+`, `-`, `*`, `/`) in combination with comparison operators directly in `.query()`.

### Problem 8

How many flights have a missing departure time? What other variables are missing for these flights? What might these rows represent?

*Write your answer here. Double-click to edit.*

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- From the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html):
    - [The query method](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method)

- Lesson and problems inspired by Chapter 5 of [R for Data Science](https://r4ds.had.co.nz/)