## dask <code>compute()</code> Deferred Computing

We're going to build a somewhat interesting workload and then run it a couple of different ways.  Let's start by loading the NYC flight data.

This exercise will reinforce dask dataframe programming concepts by building a set of analyses. We will then use these type of `groupby` and aggregate queries to look at execution properties.

Code that you need to write is indicated with #TODO. I've left the output of the reference implementation in the cells so that you can refer to it for correctness.  You can refer to the read-only shared version for this output.

Read in the NYC Flights data from Google cloud storage and then print the dataframe metadata.

In [None]:
import dask.dataframe as dd

df = dd.read_csv("../data/nycflight/*.csv",
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})
df

Let's build a set of queries around the performance of particular planes, identified by tail number.  The pattern will be to <code>groupby('TailNum') and then compute statistics.

__Query__: What is the average departure delay 'DepDelay' for each plane?

In [None]:
#TODO
df_delay

Interesting, some planes were early, lets plot a histrogram of the distribution with 1000 bins.

In [None]:
%matplotlib inline
#TODO

OK, we have very few chronically bad planes.  Let's find those that are 30 (or more) minutes late on average.

In [None]:
import numpy as np
lateplanes = #TODO
print(np.sort(lateplanes))

OK, this is a hard query.
Build a dataframe that is a subset all the data associated with the late planes.  There are many ways to solve this problem. I would recommend looking at the `isin()` function in dask.

In [None]:
df_late = #TODO
df_late

Double check that the planes indexes match.

In [None]:
import numpy as np
latelist = #TODO
print(np.sort(latelist))

Now, let's get a sense of what airports these planes fly out of.  For the planes in the late_list, let's find out the total delay at these airports, the average delay by airport and the total number of flights at each airport.

In [None]:
#TODO total DepDelay for planes by Origin airport

In [None]:
#TODO average DepDelay for planes by Origin airport

In [None]:
#TODO number of late flights by Origin airport

I don't know that these statistics all make sense, but that's to debug.

## Deferred computing

We are going to show the value of deferred computation by timing the following
queries in two different ways:

```python
df1 = df.groupby(['Origin','TailNum']).DepDelay.mean()
df2 = df.groupby(['TailNum','Origin']).DepDelay.mean()
df3 = df.groupby(['Origin','TailNum']).DepDelay.max()
df4 = df.groupby(['TailNum','Origin']).DepDelay.max()
```

 1. In one cell, add these lines and then call `compute()` on every step.
 2. In the next cell, add the lines and only call compute at the end.

 First reload the data:

In [None]:
import dask.dataframe as dd
df = dd.read_csv("../data/nycflight/*.csv",
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

Run the workload calling `compute()` on every line.

In [None]:
%%time

#TODO

Load the data again to make sure that intermediate results are not cached and run the entire workload calling `compute()` just once.

In [None]:
import dask.dataframe as dd
df = dd.read_csv("../data/nycflight/*.csv",
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

In [None]:
%%time

#TODO



### Outcomes
* Wrestled with dataframes syntax and concepts.  Good for you.
* Witnessed the benefit of deferred computation.

### Questions

1. On computational reuse in execution graphs:
    1. How much faster is it to defer the computation to the end versus calling `compute()` on every line?
    2. What computations are shared in the workflow?  Be specific, i.e. identify the code.
    3. Explain the speedup realized in 1(a). Why is it not faster? Why is it not slower?
