# Data: Isolating Data

*Purpose*: One of the keys to a successful analysis is the ability to *focus* on particular topics. When analyzing a dataset, our ability to focus is tied to our facility at *isolating data*. In this exercise, you will practice isolating columns with `tf_select()`, picking specific rows with `tf_filter()`, and sorting your data with `tf_arrange()` to see what rises to the top.

*Aside*: The data-management verbs in grama are heavily inspired by the [dplyr](https://dplyr.tidyverse.org/) package in the R programming langauge.


## Setup


In [None]:
import grama as gr
DF = gr.Intention()
%matplotlib inline

We'll use the `nycflights13` package in this exercise: This is a dataset of flights involving the New York City area during 2013.


In [None]:
from nycflights13 import flights as df_flights
df_flights

# The DataFrame Object

The variable `df_flights` above is a *DataFrame*; a way of storing data in Python.

*Aside*: The DataFrame class is provided by the [Pandas](https://pandas.pydata.org/docs/index.html) package, but we will use [Grama](https://github.com/zdelrosario/py_grama) to work with DataFrames.


## Head and Tail

Looking at an entire DataFrame is usually overwhelming; it is useful to be able to focus on a small subset of the data. One of our most basic tools is to get the first or last few rows of a DataFrame, which we can do with `DataFrame.head(n)` and `DataFrame.tail(n)`. These functions are called with the following syntax:

```python
df_flights.head(10)
```


### __q1__ Get the tail of a DataFrame

Get the last 10 rows of `df_flights`.


In [None]:
# TASK: Get the last 10 rows of df_flights
# Your code here


# Piping

Often, when carrying out studies with data, we want to perform *multiple* operations on the same dataset. We could do this by assigning *intermediate variables*, such as a temporary DataFrame:


In [None]:
# NOTE: No need to edit
df_tmp = df_flights.head(10) # Temporary DataFrame
df_tmp.tail(5)

Alternatively, we could *chain* together calls:


In [None]:
# NOTE: No need to edit
df_flights.head(10).tail(5)

Rather than these two approaches, using [grama](https://github.com/zdelrosario/py_grama) we can form a *data pipeline* using the pipe operator `>>`. The following code demonstrates the use of the pipe operator:


In [None]:
# NOTE: No need to edit
(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)

It's useful to think of the pipe operator as the English word "then". This allows us to translate code:

```python
(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)
```

into something looking like a plain-language sentence

```
(
    Start with df_flights
    "then" take the first 10
    "then" take the last 5.
)
```


*Aside*: Most of the grama functions we'll use in this exercise start with the `tf_` prefix; this means they take in a DataFrame and return a DataFrame (they `t`rans`f`orm data).


### __q2__ Convert to piped code

Convert the following code to a pipe-enabled version.

*Note*: The pipe-enabled version of `DataFrame.head()` is `tf_head()`.


In [None]:
# TASK: Convert the following code to pipe-enabled form
df_flights.head(10)

(
    df_flights
    # Complete this code

)

# Column Selection

The grama function `gr.tf_select()` allows us to select particular columns, which is helpful for getting a more "focused" view of a dataset.


### __q3__ Select particular columns

Select the columns "month", "day", "origin", and "dest".


In [None]:
## TASK: Select the columns "month", "day", "origin", and "dest"
(
    df_flights

)

## Selection Helpers

The `gr.tf_select()` function is helpful, but it is made *extremely powerful* with a few selection helpers. Rather than specify specific column names, we can use helpers to *match* names that satisfy different criteria.

Here are a few of the most important selection helpers: 

| Helper | Selects |
|---|---|
| `gr.starts_with(s)` | Columns that start with string `s` |
| `gr.ends_with(s)` | Columns that end with string `s` |
| `gr.contains(s)` | Columns that contain the string `s` |
| `gr.everything()` | All columns *not already selected* |

For instance, the following code will select all the columns whose name starts with `"dep_"`.


In [None]:
# NOTE: No need to edit
(
    df_flights
    >> gr.tf_select(gr.starts_with("dep_"))
)

### __q4__ Match columns

Select only those columns whose name ends with `"_time"`.


In [None]:
## TASK: Select only those columns whose name ends with "_time"
(
    df_flights

)

## Re-arranging columns with `gr.everything()`

The `gr.everything()` helper may seem silly, but it is actually *extremely* useful; since the everything helper selects all columns *not already selected*, we can use it to re-arrange the columns in a DataFrame for a more convenient view. For instance, we can bring a few columns closer together to aid in column comparisons, as the following code demonstrates:


In [None]:
## NOTE: No need to edit
(
    df_flights
    >> gr.tf_select("origin", "dest", "distance", gr.everything())
)

### __q5__ Re-arrange columns

Re-arrange the columns to place `dest, origin, carrier` at the left, but retain all other columns.


In [None]:
# TASK: Bring "dest", "origin", and "carrier" to the left,
# but keep all other columns
(
    df_flights

)

# Row Filtering

Just as we can select particular columns, we can *filter* to obtain particular rows.

## Accessing column values

To access a single column of a DataFrame, we can use bracket `[]` notation:


In [None]:
# NOTE: No need to edit
(
    df_flights["origin"]
)

Note that this returns a different datatype: a Pandas *series*.


## Making a comparison

Remember that we have the following comparison operators:

| Symbol | Compares |
|---|---|
| `x < y` | `x` less than `y` |
| `x <= y` | `x` less than or equal to `y` |
| `x > y` | `x` greater than `y` |
| `x >= y` | `x` greater than or equal to `y` |
| `x == y` | `x` (exactly) equals `y` |
|          | Note that `==` works for strings too! |
| `x = y` | Error! |

With a single column, we can make a comparison against a desired value:


In [None]:
# NOTE: No need to edit
(
    df_flights["origin"] == "JFK"
)

The code above gives us `True` when the `"origin"` is `"JFK"`, and `False` when it is not.

## Filtering using comparisons

These `True`/`False` values are useful, because we can use them to *filter* to only those rows where the comparison yields `True`:


In [None]:
# NOTE: No need to edit
(
    df_flights
    >> gr.tf_filter(df_flights["origin"] == "JFK")
    # Show that the "origin" is indeed "JFK"
    >> gr.tf_select("origin", gr.everything())
)

### __q6__ Find the early departures

Use `gr.tf_filter()` to find all the flights that left early.


In [None]:
# TASK: Filter for only those rows with a negative "dep_delay"
(
    df_flights

)

## The data pronoun `DF`

Way back at the beginning of this notebook, you may have noticed this line of code:

```python
DF = gr.intention()
```

This assigned the "data pronoun" to the variable `DF`. This pronoun can be used inside grama functions to refer to the data *as it is* at any stage in the pipeline. Rather than:

```python
(
    df_flights
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
)
```

we instead write

```python
(
    df_flights
    >> gr.tf_filter(DF["dep_delay"] < 0)
)
```

The data pronoun `DF` is really just an alias for the DataFrame, so we can still use bracket notation `[]`.


The data pronoun `DF` is not just convenient; it is *necessary* to make some operations work! Imagine we wanted to find all cases that satisfy both `dep_delay < 0` and `arr_delay < 0`. Consider the following code:

```python
(
    df_flights
    # This filter works properly
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
    # We now have fewer rows! The next filter will fail
    >> gr.tf_filter(df_flights["arr_delay"] < 0)
)
```

The data pronoun `DF` allows us to refer to the data *as it is* in the pipeline; this resolves the issue with having fewer rows in the second filter. You'll get a chance to fix this code in the next task.


### __q7__ Use `DF` to fix this code

Use the data pronoun `DF` to fix the following code:


In [None]:
# TASK: Fix this code using the data pronoun DF
# Uncomment to begin this task
# (
#     df_flights
#     >> gr.tf_filter(df_flights["dep_delay"] < 0)
#     >> gr.tf_filter(df_flights["arr_delay"] < 0)
# )


# Arranging

Filtering is particularly helpful when combined with *sorting*; we can sort on any column using the function `gr.tf_arrange()`. For instance, the following code sorts by the `"distance"`, pulling the shortest flights to the top of the DataFrame:


In [None]:
# NOTE: No need to edit
(
    df_flights
    # Sorts from smallest to largest
    >> gr.tf_arrange(DF["distance"])
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)

We can also reverse the order of the sort with the `gr.desc()` helper, as shown below:


In [None]:
# NOTE: No need to edit
(
    df_flights
    # Sorts from largest to smallest (*desc*ending)
    >> gr.tf_arrange(gr.desc(DF["distance"]))
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)

### __q8__ Find the earliest departures

Find the top 10 *earliest* departures. How early did these depart? Do these early departures have anything in common?

*Hint*: You will need to combine functions to accomplish this.


In [None]:
# TASK: Find the top 10 earliest departures
(
    df_flights

)

# Isolating to answer questions

Before we close this exercise, let's use these data isolation tools to answer a real question about the dataset:


### __q9__ What are these data for?

What are these data for? In particular, in what way are they "focused on the NYC area"? Complete the following tasks, and answer the questions under *observations* below.

*Hint*: You might find it helpful to use the `or` operator using the symbol `|`. The `or` keyword does not work inside a `gr.tf_filter()`.


In [None]:
# TASK: Filter to only those cases where the destination airport
# is one of "JFK", "LGA", or "EWR"
df_q9dest = (
    df_flights

)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9dest.shape[0] == 1, \
    "Incorrect filter"

df_q9dest 

In [None]:
# TASK: Filter to only those cases where the origin airport
# is one of "JFK", "LGA", or "EWR"
df_q9origin = (
    df_flights

)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9origin.shape[0] == df_flights.shape[0], \
    "Incorrect filter"

df_q9origin 

*Observations*

- How many rows have either "JFK", "LGA", or "EWR" as their *destination*?
  - (Your response here)
- How many rows have either "JFK", "LGA", or "EWR" as their *origin*?
  - (Your response here)
- Would we be answer questions related to flights *entering* the NYC area using this dataset?
  - (Your response here)
- In what sense is this dataset "focused on NYC"?
  - (Your response here)


*Aside*: Data are not just numbers. Data are *numbers with context*. Every dataset is put together for some reason. This reason will inform what observations (rows) and variables (columns) are *in the data*, and which are *not in the data*. Conversely, thinking carefully about what data a person or organization bothered to collect---and what they ignored---can tell you something about the *perspective* of those who collected the data. Thinking about these issues is partly what separates __data science__ from programming or machine learning. (`end-rant`)
