# ETL and Analytics with Dask Dataframe

The mayor is convinced that Cascadia City's library usage patterns and trends will resemble those of Seattle. So we've been asked to take a dataset from Seattle and prepare it for a number of reporting tasks. 

This will involve
* consuming the raw Seattle library data, which is in a plaintext format
* selecting specific columns and rows which are important to Cascadia City, while discarding others
* writing the refined dataset out in a more efficient binary format (Apache Parquet)
* generating various reports on usage

Along the way, we need to make sure we understand Dask dataframe well enough to help out the library administration and other departments.

<img src='images/library.jpg' width=600>

We'll start out by getting access to our Dask cluster

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)
client

Next, we'll use Dask dataframe to access the data

In [None]:
import dask.dataframe as ddf

loans = ddf.read_csv('s3://coiled-training/data/checkouts-small.csv', storage_options={"anon": True})

loans

__This looks a bit different from a Pandas dataframe ... so:__

## What *is* a Dask dataframe?

A Dask dataframe is a collection of Pandas dataframes, divided along the index. You can picture it like this:

<img src='images/dask-dataframe.svg' width=400>

The smaller Pandas dataframes which make up the larger, virtual Dask dataframe, are called *partitions*

So, at the top of the following output, the label __npartitions=__ refers to the number of constituent Pandas dataframes. You'll notice that Dask automatically chose a number of partitions to use, although you can customize that if you want to.

In [None]:
loans

You won't often need to interact with individual partitions, but you can if you need to:

In [None]:
loans.partitions[2]

Wait ... I thought you said the partition was a __Pandas__ dataframe!

It is ... but we haven't computed it yet.

In order to minimize extra computation, data movement, memory, and time, Dask's data structures try to be *lazy*

This allows them to optimize their operation: for example, maybe you end up needing just 2 columns out of a 900-column-wide table ... in that case, it makes sense to see what's really needed before loading all of the data

When we want to materialize a local, Python object, we add `.compute()` to API call

So, to tell Dask that we want to load up that partition locally, we could type

In [None]:
loans.partitions[2].compute()

That looks a lot like Pandas output, but we can check to be sure:

In [None]:
type(loans.partitions[2].compute())

If we just want to see a few records, we don't need to load even a single partition, though. Dask will give a preview with the `.head()` API

In [None]:
loans.head()

## Wait, wait ... I thought you just said I needed `.compute()`!

#### __When do I need `.compute()` and how can I tell?__

__Do call__ `.compute()` when you want a full Pandas object -- Dataframe or Series -- calculated for you *and* you want it loaded up in your local Python process (where your `Client` object lives).

This is typical for small, report-type outputs, like we'll do later in this notebook.

__Don't__ call `.compute()` on a huge Dask dataframe, because it likely won't fit in local memory anyway

__Don't__ call `.compute()` if the goal is to write out a large dataset (perhaps one that you've transformed) to disk. There are APIs for doing that directly from the cluster, in parallel, so that your local process doesn't have to deal with all that data.

__Don't__ call `.compute()` if there are simpler APIs designed for human, interactive consumption that might be more efficient, like `.head()` or `len()`

E.g., if we want to count the total number of rows in our dataframe, we can do this:

In [None]:
len(loans)

## What can I do with Dask dataframe and how do I do it?

If you're used to Pandas, it takes a little adjustment to get used to working with data without seeing all those nice rows and columns on the screen. But most of the operations you're used to -- selecting and transforming columns, filtering rows, grouping and aggregating -- still work.

In our first library task, we need to throw out the __Publisher__ and old row number ("__Unnamed: 0__") column as well as "old" data.

First, let's drop the columns

In [None]:
loans2 = loans.drop(columns=['Publisher', 'Unnamed: 0'])
loans2

But what is "old" data? Let's find all of the years in the dataset

In [None]:
loans2['CheckoutYear'].unique()

Hmm... this is a "lazy" Dask Series, but we really want the actual, concrete Series of unique years.
We know this will be a small collection, so __it's time for `.compute()`__

In [None]:
loans.CheckoutYear.unique().compute()

After checking with the bosses, it looks like we only want records from 2010 onward, and we want to omit the 2020 data since it's ... anomalous. So we can filter that whole dataset using a Pandas-style filter

In [None]:
loans3 = loans2.query('CheckoutYear >= 2010 & CheckoutYear < 2020')

#NOTE: loans2[(loans2['CheckoutYear'] >= 2010) & (loans2['CheckoutYear'] < 2020)] would work but is less efficient

loans3

Next, we've been asked to *drop all incomplete records* and then write out the cleaned, post-2010 dataset in Apache Parquet format, *partitioned by CheckoutYear*.

>
> __Aside: Why Apache Parquet?__
>
> Apache Parquet is one of the most popular, efficient, and performant formats for large-scale structured data. 
>
> Why? Because Parquet is a compressed, self-describing, binary *columnar* data format, which means that each column is stored apart from the others. So when we need to query just a few columns in a wide table, we can physically access just the ones we need on disk. The fastest data to process is the data you never load in the first place!
>
> Moreover, if we know what sorts of queries we will need to perform, we can *partition* by those values on disk as well. In our case, since we're partitioning by CheckoutYear, if we subsequently need records from 2016, we can access those and only those directly from the disk. (This kind of on-disk partitioning is sometimes called "Hive style" partitioning, after an Apache Hive pattern.)
>
> And even if we don't do that sort of on-disk partitioning, we can benefit from metadata stored along with our data.
>
> Review more details and Parquet format benefits in this Coiled blog post: https://coiled.io/blog/parquet-column-pruning-predicate-pushdown/
>

Let's `dropna()` and write out our data

In [None]:
loans3.dropna().to_parquet('cleaned-loans', write_index=False, partition_on='CheckoutYear')

<span style='color:red'>__WARNING and topic for discussion__</span>
> 
> Where did we just write the data? What would happen if we tried to read it back?

In [None]:
import os

bucket = os.environ['WRITE_BUCKET']

bucket

In [None]:
loans3.dropna().to_parquet('s3://' + bucket + '/cleaned-loans', write_index=False, partition_on='CheckoutYear')

In [None]:
import s3fs

fs = s3fs.S3FileSystem()
fs.ls(bucket + '/cleaned-loans')

__Progress Dashboard__

That seems to take a little while (well, a few seconds, at least). Let's open another dashboard view that will let us track progress.

From the Dask dashboard palette, click `Progress` and drag that to snap at the bottom of the JupyterLab window.

<img src='images/progress.png' width=900>

We'll run the same logic, but -- since we now have a clear spec and a good understanding of the data, we can even compress this workflow into the "1-liner ETL"

While it's running, you should see several colored progress bars. The colors correspond to specific functions being run (when those are functions you've defined, they'll show your function names; in this case, they're function names from the Dask dataframe library).

In [None]:
ddf.read_csv('s3://coiled-training/data/checkouts-small.csv', storage_options={"anon": True}) \
    .drop(columns=['Publisher', 'Unnamed: 0']) \
    .query('CheckoutYear >= 2010 & CheckoutYear < 2020') \
    .dropna() \
    .to_parquet('s3://' + bucket + '/cleaned-loans', write_index=False, partition_on='CheckoutYear')

In a later module, we'll remove some of the magic from the Dask dataframe by giving a little example of how you could build your own ... but first, let's take advantage of our new, structured dataset to query checkouts.

We want to track trends in physical vs. digital loans ("UsageClass")

In [None]:
report = ddf.read_parquet('s3://' + bucket + '/cleaned-loans') \
    .groupby(['CheckoutYear', 'UsageClass']).agg({'Checkouts': 'sum'}).compute()
    
report

And since this is just a Pandas dataframe, we can plot it

In [None]:
report.unstack(level=1).plot()

#### Optimizations

Parquet allows a number of optimizations at query time, and with the current version of Dask we need to provide some hints in order to take advantage of those capabilities.

To illustrate the issue, let's get average checkouts each month for 2017 and 2018. We'll also track the time for these operations.

In [None]:
%%time 

report2 = ddf.read_parquet('s3://' + bucket + '/cleaned-loans') \
            .query('CheckoutYear == 2017 | CheckoutYear == 2018') \
            .groupby(['CheckoutYear', 'CheckoutMonth']).agg({'Checkouts': 'mean'}).compute()

In this query, we're filtering on a *partition column* -- that is, we broke up our data by Checkout specifically to speed up this sort of report. Providing this info to `read_parquet` allows \"predicate pushdown\" or a \"pushed filter\" where we only read a minimal set of data from disk.

Since this is the complete filtering operation for us, we can skip the `query` call.

But how exactly do we specify the filters? The filters kwarg value is
* a list of filter groups, where we get UNION (equivalent to OR for all of them) of the filter group results
    * in our example, one filtergroup is `CheckoutYear = 2017` and another is `CheckoutYear = 2018` -- we want matches to all (a OR b) of them
* where each filter group is a list of filter tuples, and we get results from a filter group when all filters in the tuple are true (AND)
    * in our example, there is no AND condition, so our filter groups will just contain one condition each, expressed as filter tuple
* a filter tuple is a tuple containing the filter (predicate), broken up into three parts
    * the expression to test (in our case, CheckoutYear)
    * the condition operator (for us, =)
    * the predicate, or value to test against (here, 2017 for the tuple in one filter group, and 2018 for the tuple in the other)

That is a mouthful!

But it's not too bad in code. Let's try it

In [None]:
%%time 

report2b = ddf.read_parquet('s3://' + bucket + '/cleaned-loans', \
                filters=[[('CheckoutYear', '=', 2017)],[('CheckoutYear', '=', 2018)]]) \
           .groupby(['CheckoutYear', 'CheckoutMonth']).agg({'Checkouts': 'mean'}).compute()

We definitely get a speedup! 

> NOTE: You may have noticed that we've used the word 'partition' in two different ways so far
> 1. to refer to the constituent Pandas dataframes or chunks that make up our bigger Dask dataframe
> 2. for dividing up our data by the values in a particular column (for us, CheckoutYear) when we wrote and queried Parquet data
> 
> These two ideas are related but not the same, and they usually refer to different ways of dividing the data.
> It's a common source of confusion!

In this example, we read from a dataset on disk which we partitioned specifically to support this kind of query.

Although that will give you the best performance, it obviously has drawbacks... it's
* inflexible (partitioning that supports one set of queries may be more costly for others)
* impractical for high-cardinality columns (partitioning by, say, Customer ID would be a very bad idea in most scenarios)

Parquet predicate pushdown *can work* when you haven't partitioned on the predicated columns -- but it will return a superset of the requested records (because parquet files store records in row groups and, while Dask can skip row groups where metadata indicates zero interesting records, if there are some interesting records in there, the whole row group will be read) -- so you'll have to filter your data again within your Dask query. There are a couple of other fine points, so definitely check out the docs at https://docs.dask.org/en/latest/generated/dask.dataframe.read_parquet.html#dask-dataframe-read-parquet

__We can do even better than with another optimization__

This report only really pays attention to 3 columns: CheckoutYear, CheckoutMonth, and Checkouts

Since Parquet stores our data in a columnar format, we can provide another hint to only look at these 3 columns.

This kind of optimization is called *column pruning*

Let's try it

In [None]:
%%time 

report2c = ddf.read_parquet('s3://' + bucket + '/cleaned-loans', \
                filters=[[('CheckoutYear', '=', 2017)],[('CheckoutYear', '=', 2018)]],
                columns=['CheckoutYear', 'CheckoutMonth', 'Checkouts']) \
            .groupby(['CheckoutYear', 'CheckoutMonth']).agg({'Checkouts': 'mean'}).compute()

This gets us a further speedup!

It's especially nice when we consider that
1. these filter and column values are just Python, so you can automate manipulation of them if you need to (e.g., for a reporting tool)
2. a future version of Dask will be able to apply this for you automatically, by analyzing the original, simpler query

In your own work, the amount of speedup will vary depending on how much data your filters exclude and what sort of computation is going on with the remaining data.

For more detail and all of the dataframe docs, bookmark https://docs.dask.org/en/latest/dataframe.html

__Task Stream Dashboard__

Before we try Dask dataframe out with some lab projects, let's look at one more dashboard. From the palette, choose Task Stream, and snap it somewhere convenient.

`%%time` is useful, but doesn't show us a lot of detail about what Dask is doing. The Progress bars are great too, but they don't show the time relationships between different tasks.

With the Task Stream open, re-run the previous reports. Zoom in to where you can see individual tasks across your cluster cores -- color coded to match the other views like Progress -- as well as time spent transferring data (the red "overlay" boxes")

<img src='images/taskstream.png' width=900>

This sort of X-ray vision into what's happening in the cluster makes tuning and troubleshooting a lot easier than doing so with log messages and summary stats.

## Lab: Analyze library records

*Note: for all of these labs, we'll go back to the original CSV data, not the Parquet data. If you have extra time, you're welcome to investigate whether we can generate them faster via Parquet.*

#### Activity 1: Digital media schemes for the city library

We need to perform an analysis over time, similar to the "Digital vs Physical by Year" report, but we want to compare the various licensing managers for digital media.

Essentially, we want to count by year and `CheckoutType` records where
* UsageClass is Digital
* Year is prior to 2020

We want to keep as many records as we can which meet those criteria, and write a reasonably efficient query from the original CSV data.

#### Activity 2: Publishers

What are the top 50 publishers in the Seattle library system by...
* checkout activity (easier)
* library material holdings (harder)

Hints:
* Try to use Dask's `nlargest` or `nsmallest` for ordered results with a limit (like 50).
    * That approach is vastly more efficient than trying to sort a big dataset.
* For top publishers by library holdings...
    * the same item may appear in many months of data
    * Pandas/Dask doesn't have the same "COUNT DISTINCT" operator as SQL so you may have to get a bit creative
    * if you don't narrow (hint!) down the data, it will be hard to run this query with the allocated cluster resources
* If your logic works but the computation isn't running successfully, feel free to start over with more memory in your cluster (try 2GB workers instead of 1 GB)

#### Activity 3: Popular subjects

*Bonus Project*

Notice that the Subjects field contains a string list of subjects.

If we want to analyze checkouts by subject, we might start by trying to parse this field into a Python list. Like Pandas, Dask allows us to split strings as well as explode collections into multiple rows.

Try to find the top 10 subjects by checkout activity. Hint: Try to eliminate as much data as you can from the dataset as early as possible.

## No Magic

There is a lot of power to the Dask dataframe API. But it's not magic. 

To illustrate a little bit of how a parallel dataframe can work, as well as give insight into how Dask's low-level constructs can be assembled to create high-level ones, we'll build a toy Dask dataframe in a future module (once we've covered the relevant APIs).

## Additional key features of Dask and Dask dataframe

### Caching

One benefit of using a cluster is having more processing power (cores). But equally valuable is having an expanded pool of memory: for example, most of us don't have 250GB of RAM in our laptop, while even a small cluster is likely to have that much memory available.

To materialize a Dask dataframe (or any Delayed object) in the distributed RAM of the cluster, we use the `.persist()` API

`.persist` is not lazy: it immediately starts working ... but it returns a Delayed right away because we work is not done yet. So we still get a token or handle. And, actually, a token is what we want: the whole point is that we want the big data in the cluster, not in our local Python runtime!

In [None]:
client.restart() # clear out some room

In [None]:
loans_cached = ddf.read_csv('s3://coiled-training/data/checkouts-small.csv', storage_options={"anon": True}).persist()

Now we can run some queries or transformations over the data in memory... or can we?
How do we know if the data is loaded up yet?

There are several ways!

First, we can look at the __Graph Dashboard__: from the dashboard palette, click "Graph"

Each Task (delayed Python function) gets a little square, and the key explains the color coding: red boxes are tasks whose result is stored in memory. 

For a big job (and a huge graph), we can watch the boxes turn red in real time ... a sort of RAM-storage progress bar.

We can also access the information programmatically.

In [None]:
import dask

dask.distributed.futures_of(loans_cached)

A Future is another kind of handle (similar to Promises in some languages) representing a task that was started but may not have finished (or may have failed altogether). In this example, we can see that each of the Futures is `finished`. 

We can also wait for all of them:

In [None]:
dask.distributed.wait(loans_cached)

Our queries should be faster with the data in memory

In [None]:
%%time

loans_cached.mean().compute()

Compare to the non-cached timing:

In [None]:
%%time

ddf.read_csv('s3://coiled-training/data/checkouts-small.csv', storage_options={"anon": True}).mean().compute()

In practice, your speedups will depend on how expensive the I/O is relative to the computation. 

The slower, larger, and more distant the source data, the more of an improvement you'll see.

On the other hand, the more expensive and complex your computation is, the less improvement you'll see.

### Custom Functions with `.apply`

Often, you'll want to apply your own logic to data in a Dask dataframe. Like Pandas, Dask supports the `.apply` method to run your own code over rows of data.

In [None]:
def my_custom_length(field):
    return len(field)

loans_cached.Title.apply(my_custom_length)

It runs, but does suggest we add some schema information to help out.

In [None]:
loans_cached.Title.apply(my_custom_length, meta=('Title', 'int64')).head()

We can also apply to rows, allowing us to perform calculations or transformations depending on multiple columns

In [None]:
def my_combo_length(fields):
    return len(fields[0]) + len(fields[1])

loans_cached[['Title', 'Subjects']].dropna().apply(my_combo_length, axis=1, meta=(None, 'int64')).head()

Custom functions are also supported for aggregations and rolling ("window") computations.

Cached objects can get cleaned up automatically when the client process no longer has handles to the objects... but if we want to a quick reset on cluster memory, we can use

In [None]:
client.restart()

### Quirks and limitations

As you've probably noticed, Dask dataframe implements a lot of the Pandas API. At the same time, there are also some quirks to get used to (e.g., the schema hints we just provided) as well as functionality that is not implemented ... at least not yet.

You can refer to the docs to see which APIs are implemented differently (or not at all). But another approach is to try your planned computation (based on Pandas knowledge) on a small subset of your date -- in a non-destructive way -- and see if it runs and the results check out. Different users will likely prefer one or the other of these tactices.

### Best practices

Some additional best practices for working with Dask dataframe as well as patterns/anti-patterns are documented here
* https://docs.dask.org/en/latest/dataframe-best-practices.html
* https://docs.dask.org/en/latest/dataframe.html#common-uses-and-anti-uses

Coiled has some short, useful blog posts explaining...
* Repartitioning dataframes and dealing with small/empty partitions https://coiled.io/blog/repartition-dataframe/
* Using the index to select within dataframes, and Dask's 'known divisions' index concept https://coiled.io/blog/filter-dataframes-loc/
* Calculating memory size of dataframes and partitions https://coiled.io/blog/dask-memory-usage/
    * Especially valuable since you'll want to maintain a relationship between partition sizes and worker memory size

Common scenarios are explained in the docs, including...
* Shuffles https://docs.dask.org/en/latest/dataframe-groupby.html
* Joins https://docs.dask.org/en/latest/dataframe-joins.html
* Categorical types https://docs.dask.org/en/latest/dataframe-categoricals.html

And if you're curious about how it all works, a design description of the internals is at https://docs.dask.org/en/latest/dataframe-design.html ... from there you can take a look at source if you'd still like see more.

### Swappable partition dataframe implementations and RAPIDS cuDF

Since Dask dataframe is architected around proxying to Pandas dataframes ...
and Python allows us to swap in alternative objects, provided they implement the same protocol or interface ("duck typing") ...
we can use Dask with other dataframe implementations.

Most notably, this support scalable GPU-based dataframes but placing Dask on top of cuDF dataframes in NVIDIA RAPIDS
* https://docs.rapids.ai/api/cudf/stable/10min.html#
* https://docs.rapids.ai/api/cudf/stable/dask-cudf.html