# Wrapup

### Best-practices link roundup

The Dask docs collect a number of best practices:

-   Dataframe: <https://docs.dask.org/en/latest/dataframe-best-practices.html>
-   Array: <https://docs.dask.org/en/latest/array-best-practices.html>
-   Delayed: <https://docs.dask.org/en/latest/delayed-best-practices.html>
-   Overall: <https://docs.dask.org/en/latest/best-practices.html>

### Partitions/Chunks and Tasks

Remember that Dask is a scheduler for regular Python functions operating on (and producing) regular Python objects.

Your partitions, chunks, or data segments should be small enough to comfortably fit in RAM for each worker thread/core.

At the limit, that means...

-   if you have a 1GB worker with 1 core, want to keep your partitions below 1GB
-   with 2 x 1 GB workers with 1 cores, we still want partitions below 1GB
-   with n x 4 GB workers with 2 cores per worker, we want partitions below 2 GB

But...

- Some tools, like Pandas, may require significant overhead to process a partition. In that case, you may want partitions that can fit 2x-4x in RAM
- It's also helpful to have a few chunks of data available to keep Dask's worker cores busy.

So we might want to take those numbers above and make them 2-4x smaller (or, equivalently, create 2-4x as many partitions).

Generally speaking, a lot of tasks is not a bad thing...
* Scheduling overhead for each additional task is typically less than 1 millisecond, and can be a lot less
* That said, if you have, say, a million tasks, those milliseconds will add up to minutes. In that case you may want to simplify your task graph or use larger (and hence fewer) partitions/chunks.
* We want to have significant work for each task relative to scheduler cost, so hundreds or even thousands of milliseconds is ok for task compute time

### Caching (Persistence)

The results of computations can be cached in the cluster memory, so that they are available for reuse, or for use to derive subsequent results.

(See: `persist` which is available on `Client`, `Bag`, `Array`, `Dataframe`, etc.; `Future` results are cached by default)

Use caching wisely (not indiscriminately) and monitor memory usage using the dashboard.

### Data 

__Location__

Choose data locations...
* to minimize __(amount of data)\*(network cost of moving the data)\*(how often that data needs to move)__
* there may not be a perfect arrangement, especially across different workloads
* decompression cost is usually smaller than the cost of moving uncompressed data

__Formats and Compression__

Use compression schemes which are *splittable* and allow random access, so that processing your files in parallel is more flexible, e.g., Snappy, LZ4 instead of gzip.

For datasets, consider files (and collections of files) in Parquet, ORC, HDF5, etc.


### Dask project documentation quick links
* Main project page https://dask.org/
* Core documentation https://docs.dask.org/en/latest/
* Distributed (scheduler) https://distributed.dask.org/en/latest/
* Machine learning https://ml.dask.org/
* Deployment tools
    * Kubernetes https://kubernetes.dask.org/en/latest/
    * AWS or Azure https://cloudprovider.dask.org/en/latest/
    * YARN https://yarn.dask.org/en/latest/

### Community resources
* Dask issues and source code https://github.com/dask
* Dask Github Discussions https://github.com/dask/dask/discussions
* StackOverflow https://stackoverflow.com/questions/tagged/dask
* Gitter https://gitter.im/dask/dask

### Coiled Computing
* https://coiled.io/
* Coiled Cloud https://coiled.io/cloud/
* Join the Coiled Slack community 

# Q & A