<img style="float: right" src="img/saturn.png" width="300" />

# Machine Learning on Big Data with Dask

## Introduction to Dask
<img src="https://docs.dask.org/en/latest/_images/dask_horizontal_no_pad.svg" width="300" alt="dask" />

Before we get into too much complexity, let's talk about the essentials of Dask.

## What is Dask?

Dask is an open-source framework that enables parallelization of Python code. This can be applied to all kinds of Python use cases, not just machine learning. Dask is designed to work well on single-machine setups and on multi-machine clusters. You can use Dask with pandas, NumPy, scikit-learn, and other Python libraries. If you want to learn more about the other areas where Dask can be useful, there's a [great website explaining all of that](https://dask.org/).

## Why Parallelize?

For machine learning use cases, parallelizing work with Dask can be useful if:

- Data sizes exceed memory of a single node
- Complex data transformation that is slow on a single node
- Complex models that require a lot of resources
- Many compute tasks that can execute at the same time (think hyperparameter tuning, ensemble models)

## Initialize Dask cluster

The `dask_saturn` package makes the Dask Cluster that we created from Saturn Cloud accessible in our notebook. If the cluster was already created, we would not need to specify any arguments when initializing `SaturnCluster`, but it is a good idea to do so for reproducibility purposes. The arguments to `SaturnCluster` match the fields presented when editing a Dask Cluster from the Saturn Cloud.

In [None]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster(
    scheduler_size='medium',
    worker_size='xlarge',
    n_workers=5,
    nthreads=4,
)
client = Client(cluster)

<br>

To see the options for scheduler and worker sizes, and how they match up to the options presented in Saturn Cloud, run the following:

In [None]:
from dask_saturn.core import describe_sizes
describe_sizes()

<br>

The `Client` object is our "entry point" to Dask. Most Dask operations will automatically detect the client and run operations across the cluster, but sometimes its necessary to pass a `client` object when performing more advanced operations. Previewing the `client` object tells us details about the cluster and a link to the Dashboard. Open up the Dashboard now and keep it  visible in a separate window - you'll see it light up when we run Dask operations!

In [None]:
client

The following cell will block until all workers are available. You can also view cluster status and access the Dashbaord link from the Project page in Saturn Cloud.

In [None]:
client.wait_for_workers(5)
print('Ready to go!')

## Lazy evaluation - dask.delayed

Delaying a task with Dask can queue up a set of transformations or calculations so that it's ready to run later, in parallel. This is what's known as "lazy" evaluation - it won't evaluate the requested computations until explicitly told to. This differs from other kinds of functions, which compute instantly upon being called. Many very common and handy functions are ported to be native in Dask, which means they will be lazy (delayed computation) without you ever having to even ask. 

However, sometimes you will have complicated custom code that is written in pandas, scikit-learn, or even base python, that isn't natively available in Dask. Other times, you may just not have the time or energy to refactor your code into Dask, if edits are needed to take advantage of native Dask elements.
If this is the case, you can decorate your functions with `@dask.delayed`, which will manually establish that the function should be lazy, and not evaluate until you tell it. You'd tell it with the processes `.compute()` or `.persist()`, described in the next section. We'll use `@dask.delayed` several times in this workshop to make PyTorch tasks easily parallelized.

Let's start with a small example. We have a function `multiply()` that multiplies two numbers together. We can call the function to see its result:


In [None]:
def multiply(x, y):
    return x * y

multiply(2, 3)

Now we can decorate the function with `@dask.delayed` to indicate that we want to the function to execute lazily on our cluster:

In [None]:
import dask

@dask.delayed
def multiply_dask(x, y):
    return x * y

multiply_dask(2, 3)

This is quite different output than calling a normal function. This is because Dask hasn't done anything yet! Call `.compute()` to get the actual result.
> Tip: Open up the Dask Dashboard to see the task executing on the cluster!

In [None]:
multiply_dask(2, 3).compute()

We can even chain together multiple delayed functions:

In [None]:
x = multiply_dask(2, 3)
y = multiply_dask(3, 4)
z = x * y
z

### Exercise

Get the result of `z`!

In [None]:
<<< FILL IN >>>

In [None]:
z.compute()

## Persist vs Compute

Lots of new users of Dask find the `.persist()` and `.compute()` processes confusing. This is understandable! But the answer is not as hard as you might think.

First, remember when working with a cluster we have several machines working for us. We have our Jupyter instance right here running on one, and then our cluster of worker machines also.

If we use `.compute()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, and bring it all to the surface here, in Jupyter. That means if it was distributed we want to convert it into a local object here and now. If it's a Dask Dataframe, when we call `.compute()`, we're saying "Run the transformations we've queued, and convert this into a pandas dataframe immediately." If our data is too big to be held in local pandas memory, this can be a disaster! But if it is small, then we might be fine.

If we use `.persist()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, but then the object is going to remain distributed and will live on the cluster, not on the Jupyter instance. So when we do this with a Dask Dataframe, we are telling our cluster "Run the transformations we've queued, and leave this as a distributed Dask Dataframe."

So, if you want to process all the delayed tasks you've applied to a Dask object, either of these methods will do it. **The difference is where your object will live at the end.**

We will use `persist()` in later examples when working with Dask DataFrames.