# Keeping Results Growing on the Server Farm:<br> Data and Resiliency

We've looked at passing data from task to task, emphasizing leaving results in the cluster and letting Dask's scheduler place tasks to minimize data movement.

<img src='images/pumpkin.jpg' width=400>
But let's get really detailed about the three big concerns around data movement:

1. Supplying input data to your computations
2. Storing the results of computations
3. What happens when intermediate data exceeds cluster memory

We'll also look at general resilience, performance, and debugging issues.

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)
client

## Providing data to begin a computation

### Implicit read from client process

For (very) small amounts of source data, we can load it "implicitly" from our local process (where the `Client` is running) via parameters...

In [None]:
def sum_of_squares(a_list):
    return sum( (a*a for a in a_list) )

r1 = client.submit(sum_of_squares, [1,2,3])
r1.result()

... or outer-scope references ...

In [None]:
outer_scope_list = [1,2,3]

def sum_of_squares():
    return sum( (a*a for a in outer_scope_list) )

r2 = client.submit(sum_of_squares)
r2.result()

While this will technically work for larger chunks of data (up to a point), it will degrade performance, since it makes our local process into a bottleneck for data loading.

### Explicit read from a shared source

We want to load large amounts of data from some shared location, in parallel. The best-case scenario for loading data is for the tasks to get it directly from a shared filesystem that is both network-local and very fast.

If your __Task Stream__ and __Graph__ dashboards aren't open, open those up.

In [None]:
import dask.dataframe as dd

df = dd.read_csv("s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv", 
                 storage_options={"anon": True},
                 blocksize='32MB')
df.partitions[:2].map_partitions(len).compute()

As we see from the Task Stream and Graph, we're processing these in parallel from the source.

### Use of cached results

We've seen that we can cache datasets in the cluster memory via `.persist()` and get back handles to cached data as well as `Future`s we can use to track status.

We can then use these handles (to cached data structures) to run subsequent operations. This approach is a great way to supply data to computations *if* the caching makes sense in the first place.

When does caching makes sense?

* Caching makes sense when we (or others using our cluster) will be making multiple uses of the stored data
    * (if we only ever use a dataset once, there's no benefit to caching it)
* __and__ we can fit much/most/all of it in memory
* __and__ we feel that the RAM is better used for this dataset than another dataset we may want to cache
* __and__ the form of the data in RAM is a sensible relative to alternatives
    * e.g., if we have a wide table cached in RAM, and we expect to run multiple queries over it, but perhaps using only a few columns or using coarse-grained predicates that lend themselves to on-disk partitioning, then we might do better to make a local scratch copy in Apache Parquet format, or trimmed down, etc.
* Conversely, if the alternative is expensive and time-consuming network reads from distant locations, then caching may make sense even when some of the above conditions don't hold.
    * There are definitely "gray areas" ... for example, suppose you can't fit keep the whole dataset in RAM the whole time, so some parts are spilling to disk. That may still be faster than some remote reads.
    
The purpose of these guidelines is *not* to discourage caching -- after all, you're paying for a bunch or RAM and you should try to get maximum use from it -- but rather to point out that caching should be a part of your design which you think about. Putting `.persist()` throughout your code without further consideration is not a good pattern.

### Explicitly distributing local data

Sometimes we have datasets which are not accessible/addressable directly for the workers, and too large to "implicitly" send with tasks.

> Example: MB or GB scale datasets that we have intentionally created or retrieved to the client, and which we want to modify locally and then supply for future computation.

Dask's `scatter` feature will distribute these objects to workers.

In [None]:
import numpy as np
import sys

medium_array = np.random.uniform(0, 1, (100, 100, 100))
sys.getsizeof(medium_array)

In [None]:
f = client.scatter(medium_array)
f.status

Notice that the object lands on one worker -- look for the `ndarray` result

In [None]:
client.has_what()

If we break up our data and `scatter` a list, it will be distributed across workers round-robin in proportion to their cores. As with task submission, specific worker destination(s) can be specified if needed.

In [None]:
part1 = medium_array[:50]
part2 = medium_array[50:]

f2 = client.scatter([part1, part2])

In [None]:
client.has_what()

We can then use the futures as parameters to any task that needs that data.

In [None]:
client.submit(np.sum, f2[0]).result()

## Storing data (or expensive handles) in the cluster between computations

In between operations, we may need to store large results or other expensive items (e.g., connection handles which are small but expensive to open and not serializable, so they cannot persist or traverse the network).

### Implicit worker storage of results

We've seen that the results of computations which we may still need are implicitly stored in the cluster memory. We can refer to these refer indirectly with futures.

But the cluster has memory limits. As memory fills up, Dask will spill this data to disk ... and may eventually have to restart the worker.

The core heuristic works like this:

1. At 60% of memory load (as estimated by `sizeof`), spill least recently used data to disk
2. At 70% of memory load, spill least recently used data to disk regardless of what is reported by sizeof
3. At 80% of memory load, stop accepting new work on local thread pool
4. At 95% of memory load, terminate and restart the worker

Even if the worker never reaches stage 3 or 4, data will be spilled to local disk.

__The location for spill__ is the `local_directory` with which the `Worker` is instantiated (`--local-directory` if launched from the command line); if that's not present, it falls back to the `temporary-directory` Dask config option, and after that to the OS current working directory of the Worker's process. 

__Why mention that detail?__ Because configuring that to point to a fast local scratch space will improve overall performance, especially in operations that are likely to spill due to large intermediate results. Conversely, slow storage (spinning disk as opposed to SSD/NVMe) or network-mounted storage can degrade performance.

In general, memory is a critical resource and swapping to disk, while sometimes necessary, can impose hard-to-see costs. This blog post -- https://coiled.io/blog/tackling-unmanaged-memory-with-dask/ -- describes some recent improvements to Dask's memory management as well as dashboard reporting.

### Manual storage of temporary/intermediate results

Your tasks are free to persist their own intermediate results, large or not-so-large, to a cluster-accessible filesystem or database. Currently, Dask does not include such a scratch facility, but a colocated system like Redis can be added if necessary.

In some cases, even a slower local storage medium -- say, a cluster-local shared filesystem like Minio -- can yield significant gains.

Consider the earlier example of loading NYC taxi data from S3 and imagine we want to perform lots of analytic queries on this dataset. Suppose further that we can't keep it all in cluster memory. Manually creating a local cache, in a performant format like Apache Parquet, may be vastly faster than repeatedly retrieving the data from S3, even if it's not as quick as having the whole dataframe in RAM.

### Storage of expensive objects or handles

Sometimes we have an object, like a connection to a remote system, which cannot be serialized. In a pure, stateless task pattern, we'd recreate it as needed. But sometimes bending the rules is helpful.

In [None]:
from dask.distributed import get_worker

def worker_local_demo():
    try:
        local_val = get_worker().data['my_key']
    except KeyError:    
        get_worker().data['my_key'] = 'my_val'
        return 'stored'
    return local_val

client.submit(worker_local_demo).result()

Depending on where this task gets dispatched, you might have to run it twice to see `my_val` returned. Why?

In [None]:
client.submit(worker_local_demo).result()

The stored value is local to the worker process, but access is not necessarily thread safe. Thread-specific storage could be used via a `threading.local` object.

## Handling output (result) data

If your output is small -- a report that can be viewed locally, or a set of model parameters -- then you may want to collect it to your local process via `.compute` or `.result`. 

But often we have a large result
* we've filtered or joined large datasets and the result is large
* we've performed an expensive transformation on array data but the result is still a very large array
etc.

In these cases, we do *not* want to retrieve the full result locally, but instead __write from our tasks in parallel to a destination__ like a shared filesystem, database, Kafka topic(s), etc.

Here we'll find some NYC cab rides over $300. For this demo, there are only a few hundred such rides, but we might imagine that for many (lower) ride thresholds or larger fare datasets, the results could be too large to collect locally.

For built-in collections, we'll use APIs like `.to_parquet`, `.to_zarr`, `.to_textfiles` and the like, which will write partition-wise results to separate output in parallel.

In [None]:
import os

bucket = os.environ['WRITE_BUCKET']

In [None]:
df[df['total_amount']>300].to_csv('s3://' + bucket + '/expensive-rides')

It is sometimes disconcerting to users that they end up with a bunch of `.part` files ... but this is usually what we want. This allows parallel writing and subsequent efficient parallel reading.

In [None]:
dd.read_csv('s3://' + bucket + '/expensive-rides/*').describe().compute()

If you really need to combine the output .part files into a single file -- perhaps for consumption by another tool -- you can do this with 
* OS-level operations (`cat`)
* filesystem-level helpers (HDFS `-getmerge`)
* or a separate step in your workflow altogether (S3 doesn't support in-place merge) so that it doesn't bottleneck your work

(Dask can write to a single CSV file with the `single_file` flag, but make sure you want that behavior and plan for the cost.)

__Custom Code__

For your custom code, emulating the patterns used in the Dask internal collections -- writing to a shared location with a separate file for each partition's output -- is a good default design.

## Resilience and Debugging

### User code failure

In the case of user code raising an exception, that exception is either re-raised on the local process, or is available for inspection (or re-raising) but will not cause the worker, independent tasks, or the Dask cluster to fail.

In [None]:
from dask import delayed

@delayed
def calculate(n):
    return n/0

try:
    calculate(10).compute()
except Exception as e:
    print(e)

### Worker (process) failure

Loss of a worker -- but not the nanny -- will result in the nanny restarting the worker, as we've seen.

### Worker node/container/nanny/network failure

In these situations, the scheduler will try to reschedule work on other workers, which will often allow a computation to succeed.

> To get the best throughput, however, we may want to replace lost workers with new ones. The best way to ensure that new workers are created is to use the cluster's `.adapt` method. Even if we are not looking for a fully dynamic, scaling cluster, `adapt` can be used with `min` and `max` values to ensure that our worker pool remains within desired bounds.

Even if workers are restored (or not needed), some there are some failure cases to watch out for.
* If results depend on impure functions (e.g., a random value) then you may get a different result
* If functions rely on side effects (e.g., looking at some OS/FS/container state value), then results are unpredictable
* If the worker failed due to a bad function, for example a function that causes a segmentation fault, then that bad function will repeatedly be called on other workers.
    * This function will be marked as “bad” after it kills a fixed number of workers (defaults to three).
* Data sent out by user code to the workers via a call to `scatter()` (instead of being created from a Dask task graph via other Dask functions) may be irrecoverable by Dask (although user code may be able to supply it again).
    * One way to "harden" the data availability in that scenario is via `Client.replicate` or to manually put it in resilient storage

#### Avoiding "worker cruft" in long running clusters

In a very long running cluster, we might occasionally build up state that endangers a worker. For example, the local scratch disk may fill up.

Although it's an uncommon scenario, there are a couple of tools we can use in this situation:

`client.restart()` will trigger a restart of all workers in a "clean" state -- this is also handy for troubleshooting or ensuring the cluster is in a known state before benchmarking new code.

We might also programmatically retire workers in a long running cluster. The worker parameters enabling this are
* `lifetime`
* `lifetime_stagger`
* `lifetime_restart`
documented at https://distributed.dask.org/en/latest/worker.html

### Scheduler failure

In a standard configuration, a scheduler failure is not recoverable. A new scheduler can be created, and existing workers may be able to re-connect to it, but in-progress computations (dependency graphs) are lost.

> There is some experimental work on a multi-scheduler configuration which could potentially allow cluster users to continue to submit work. *But dependency graphs and other state in the failed scheduler are not replicated to other schedulers.* In other words, this is a load-balancing pattern that can minimize service interruptions, but it's not currently intended as a high-availability solution.
> 
> More details are in this post: https://coiled.io/blog/dask-in-production-multi-scheduler-architectures/

In a similar way, a "higher-order" cluster manager like a Kubernetes supervisor might also be able to create a new Dask scheduler on failure, but will not be able to restore lost scheduler state.

## Work Stealing

Work stealing is the process whereby the scheduler removes work from one worker and assigns it to another. Work stealing is really about performance rather than resilience. But inasmuch as it can improve the end-to-end effectiveness of Dask applied to your problems over time, and aims to minimize overloading individual workers, it helps in many areas.

### Why would things become imbalanced in the first place?

Task placement is done largely to support data locality, or minimizing the transport of data dependencies. This is generally helpful, but can mean assigning a lot of tasks, or long-running tasks, to a few workers that have critical data.

When the imbalance is big enough, it makes sense to re-assign tasks to an idle worker, rather than wait for the "star workers" to crank through all of their assigned work.

Work stealing is largely automatic (detailed at https://distributed.dask.org/en/latest/work-stealing.html) but it's another situation where we can help be ensuring our data is available as widely as possible -- e.g., by storing large intermediate results in fast shared storage. Whether that (which will also incur costs) is worth it depends a lot on the details of the job.

# Debugging

Debugging at-scale compute jobs is challenging, for a variety of reasons.

* Analyzing parallel computations is hard in general
* Scale adds additional sources of unpredictability
    * On a single machine, we worry about state and timing
    * With multiple machines we add network issues, storage issues, and additional hardware variables
* Problematic data records (the ones that cause a failure) may be a tiny fraction of a huge dataset, making them expensive to find
* System-level failures (e.g., memory issues) may be dependent on
    * prior jobs or other simultaneous jobs
    * how those other jobs are/were distributed
    * the physical location of replicas of data used in a job
    
For repeated, batch type jobs, a common source of unpredictable bugs in production is the change in data selected (and its distribution).
* Tuning for Monday's job may not hold up for Friday's data
* If your storage and Dask workers are colocated, locations of needed replicas in local distributed storage (e.g., HDFS) may be different relative to placement of Dask workers across runs

## "Easy Level": application logic errors

The easiest errors to debug are errors in specific application logic. These could be errors in our custom functions or even bugs in Dask code.

Because these are deterministic and less dependent on the data and environment, we can try to reproduce these errors and then fix them.

We've seen that a failed Future makes its exception available 

In [None]:
try:
    client.submit(lambda x:x/0, 5).result()
except Exception as e:
    print(e)

The exception and traceback are also available directly (i.e., without raising the error)

In [None]:
f = client.submit(lambda x:x/0, 5)
f.exception()

In [None]:
type(f.traceback())

Failed future errors can be programmatically raised for local debugging. Notice that the traceback here is slightly different, which may be helpful for some errors.

In [None]:
try:
    client.recreate_error_locally(f)
except Exception as e:
    print(e)

For application-code errors that don't require full-scale operations to reproduce, we can also try to re-create them in smaller, easier to debug settings.

* `.compute(scheduler='single-threaded')`
* `LocalCluster`

## Distributed troubleshooting

For "big-data, big-cluster" error cases -- where you cannot simply reproduce the error in a limited environment -- we will need to narrow the problem down by a combination of 
* filtering data (to locate problematic records or dataset size thresholds)
* inspecting worker reports and dashboards
* reading worker logs

From the main Dask dashboard, the __Info__ tab provides a list of workers along with links to each worker's
* memory and call stack data
* logs
* realtime (animated) dashboard

These tools can give further insight into what is going wrong, particularly when we're seeing a resource exhaustion scenario where our job fails or takes indefinitely long to make progress.

Finally, for running arbitrary code on workers -- outside of the scheduler/task framework, typically for inspecting or mutating the worker environment, we have `client.run`

`client.run` allows us to run a function on all workers, or on a specific list of workers.

In [None]:
import os

client.run(os.cpu_count)

There's a shortcut for accessing the worker itself. Instead of using `get_worker()`, you can create a dummy argument called `dask_worker` which will be populated by the worker instance.

In [None]:
from dask import delayed

delayed([1,2,3]).persist()

In [None]:
import distributed

def get_data(dask_worker):
    return [k for k in dask_worker.data.keys()]

In [None]:
client.run(get_data)

... and don't forget, before assuming things about the current config, to take a look via the `dask.config` APIs!

# Wrap-up

It would be disingenuous to pretend that any tool will make distributed, big-data debugging simple... and even more disingenuous to pretend we won't encounter any problems at all.

But, relative to the power and breadth of functionality, Dask offers an extremely strong suite of tools for seeing inside your living, breathing cluster while it's running, and performing diagnostic inspections when things don't go right.

More importantly, the best tool for avoiding distributed data headaches is attention to best practices in design.

> As we've highlighted earlier, there are a variety of patterns for distributing work and data which may not be obvious or necessary in local or small-scale computation, but become critical defining factors for successful large-scale operations.

Dask helps us by doing a solid job of "making it easier to do the right thing and harder to do the wrong thing" and if we supplement that with a thoughtful design, code as simple as possible, and sensible patterns, we can avoid, rather than confront, most hassles... and that will let us make more progress on the end goal for which we are processing data in the first place.