


**Author:** Steffen Schober



## Acknowlegment



This notebook is based on the DASK tutorial.



## Prepare Data



Prepare the data, make sure that `prep.py` is the same directory than this notebook.



In [None]:
%run 03_prep.py -d flights

In [None]:
%run 03_prep.py -d accounts

## Setup



In [None]:
from dask.distributed import Client

client = Client(n_workers=4)
client

You can access the dashboard using your web browser, the linke is also found here:



In [None]:
print(client.cluster.dashboard_link)

Explore the dashboard, you can find a lot of information there.
Note that under `Info` you find information about
the TCP endpoint of the scheduler (you can use this to connect to the cluster via the `Client`.).



## First Example - Loading CSV file



In [None]:
import os
import dask
filename = os.path.join('data', 'accounts.*.csv')
filename

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)
df.head()

In [None]:
# load and count number of rows
len(df)

## Flights Data Set



In [None]:
# load and count number of rows
df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),
  	       parse_dates={'Date': [0, 1, 2]},
  	       dtype={'TailNum': str,
  		      'CRSElapsedTime': float,
  		      'Cancelled': bool}
)
df

Notice that the representation of the dataframe object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes.
We enforce the dtype for three columns, because those do not contain data in the first rows, hence,
type inference will fail&#x2026; (you can check this by omitting the `dtype` in `read_csv()`).



In [None]:
df.dtypes

In [None]:
df.head()

Unlike `pandas.read_csv` which reads in the entire file before inferring datatypes,
`dask.dataframe.read_csv` only reads in a sample from the beginning of the file (or first file if using a glob).
These inferred datatypes are then enforced when reading all partitions.



### Some Analysis



We compute the maximum of the `DepDelay` column. With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums

    maxes = []
    for fn in filenames:
        df = pd.read_csv(fn)
        maxes.append(df.DepDelay.max())
    
    final_max = max(maxes)

We could wrap that `pd.read_csv` with `dask.delayed` so that it runs in parallel.
Regardless, we’re still having to think about loops, intermediate results (one per file) and the final reduction (max of the intermediate maxes).

    df = pd.read_csv(filename, dtype=dtype)
    df.DepDelay.max()

`dask.dataframe` lets us write pandas-like code, that operates on larger than memory datasets in parallel.
Here we compute the max of `DepDelay`:



In [None]:
%time df.DepDelay.max().compute()

Let's visualize the graph:



In [None]:
# notice the parallelism
df.DepDelay.max().visualize()

## Exercises



Try to answer the following questions:

1.  How many rows are in our dataset?
2.  In total, how many non-canceled flights were taken?
3.  In total, how many non-cancelled flights were taken from each airport?
4.  What day of the week has the worst average departure delay?

Hint for the third question:
use `groupby` with the aggregate function `count`.
See [https://pandas.pydata.org/pandas-docs/stable/groupby.html](https://pandas.pydata.org/pandas-docs/stable/groupby.html).



## Sharing Intermediate Results



When computing all of the above, we sometimes did the same operation more than once.
For most operations, `dask.dataframe` hashes the arguments, allowing duplicate computations to be shared, and only computed once.

For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights.
Since dask operations are lazy, those values aren’t the final results yet. They’re just the recipe required to get the result.

If we compute them with two calls to compute, there is no sharing of intermediate computations.



In [None]:
non_cancelled = df[~df.Cancelled]
mean_delay = non_cancelled.DepDelay.mean()
std_delay = non_cancelled.DepDelay.std()

In [None]:
%%time

mean_delay_res = mean_delay.compute()
std_delay_res = std_delay.compute()

But let’s try by passing both to a single compute call.



In [None]:
%%time
mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)

The task graphs for both results are merged when calling `dask.compute`, allowing shared operations to only be done once instead of twice.

