# Thinking Distributed

We've seen a lot of the user-facing functionality of Dask, and today we've got a new set of goals to work on. 

We need to answer questions about how Dask works, and what it will require in terms support, tuning, and debugging.

The mayor and CTO are skeptical -- they've seen big complex systems spiral out of control before -- so we need to get crystal clear on the nuts and bolts of deploying and operating Dask.

<img src='images/noc.jpg'>

## Why is distributed computing harder than local computing?

Fundamentally, there are a few challenges that appear in distributed computing which don't appear in local computing (e.g., on a laptop or a single server). 

__No single OS abstraction over faults and timing problems__

We often imagine that distributed computing is hard because things fail -- network connections, nodes, containers, etc.

* That's only partially true
* Those things do fail, but similar failures occur on individual machines
    * RAM can be flaky, hard drives or SSDs develop bad or unusable sectors, CPUs glitch under strain.

On a local machine, the operating system hides these failures from us almost all of the time. There's no free lunch: we almost always pay the price in lower performance, when the OS clocks a CPU down for thermal reasons or struggles with bad storage blocks. But we rarely notice.

There also timing and synchronization problems -- e.g., different cores may want to use the same cache memory or contend for a hardware device -- but, again, we don't see those problems (except inasmuch as things run slower than optimal).

__The OS provides an abstraction that makes the local machine *appear* to operate perfectly until hitting extreme conditions.__

In a distributed environment, the challenges are similar:
* Distributed RAM or state
* Contention for shared resources
* Synchronization of threads/cores for producing results
* Communications (network <-> bus) usage
* Costs of moving data
* Hardware failures

They appear bigger because there is no single OS to hide all of that from our user-level code.

__Many visions of a distributed OS__

Creating a distributed compute system that provides the seamless, transparent illusion that we are working locally has been a goal of computer science nearly since the beginning.

Many systems have been created but none are perfect.

One might consider __Kubernetes__ or even __Dask__ to be a sort of distributed OS.

* In the Kubernetes model, deploying an Operator is supposed to be a bit like installing an application or service on a \*nix or Windows machine
* Dask provides user-level abstractions that automagically handle a lot of the details of distributed computing

Although these two examples operate at different levels of abstraction, they have some things in common
* Neither one allows transparently distributing code designed for a single server (although they get close in their own ways)
* Neither one is trivial to operate (although they have a lot of simplicity in proportion to their scope)

__Final thought on challenges: operating at the limits__

One interesting element which unifies both distributed and local computing is the fact that these challenges are non-linear, in the sense that they become critically expensive as we approach our hardware's limits.

In most day-to-day work, we are not pushing our laptops to their limits -- and in many cases the OS won't even let us try. But those limits exist: ask any gamer who overclocks their system how they know when to stop overclocking!

Similarly, mundane distributed systems, like web applications and email are nearly perfectly solved today, as evidenced by the fact that, worldwide, our cultural, political, scientific, and financial system rely on web pages and web services and run with many "9s" of uptime.

But, like PC overclockers, we -- firms, governments, researchers -- in the "big data" world are continuously operating at the very edge of what's possible with our compute clusters. So in some sense it's no wonder we contend with out-of-memory errors or other failures far more frequently than we would if we operated web publishing applications.

<img src='images/trade.png'>

## Resources

Thinking about distributed computation is thinking about key resources:
* Memory
* Compute cores
* Time
* Fast storage space
* Slow storage space
* System simplicity (vs. complexity)
* Ease of understanding for humans / explaining to other humans

When we have lots and lots of those resources, most problems aren't hard. But that doesn't often happen.

Why?

* Economic/efficiency reasons: we want to pay only for resources that we are using, and use the resources we are paying for
    * If we don't need as much of a resource, we either don't buy it, or it gets allocated to someone else
* We don't know how much we'll need for a given computation goal
    * Costs and tradeoffs can be opaque, even with solid knowledge of the code and the system
   
*Typically, we are discovering our resource limits, the performance impacts of those limits, and our alternative options in a synergistic process of building and tuning a system. Complex tradeoffs continuously arise.*

### Examples of Tradeoffs and Resources

__Example #1: too much data to fit in memory__

I have a problem that *could* run in a reasonable amount of time on a single computer but we can't fit all of the data in memory.

* Calculating a value for each record (already in memory) takes 10 milliseconds
* On my computer, I have 100,000 records that I can load into memory, 
    * the job takes 10 * 1e-3 * 100 * 1e3 = 1000 seconds, or ~16 minutes
* I want to run 200,000 records, and ~32 minutes of runtime is fine
* __But__ I can't fit 200,000 records in memory

So I look at Dask, which can help swap data in and out of memory as needed.

But that incurs disk access ... how much does that cost?
* it depends on how big each "record" is ... it could be a lot if the records are large or complex
* it depends on where the "disk" is: in my laptop? on a departmental server? in Amazon?

Suppose data is too far away for acceptable performance
* can I cache it locally?
* how long does that take?
* if I cache it locally, how much "value" do I get from that; e.g., can I use that same fast local copy for other work?

__Example #2: too much compute to run on one node__

I have a problem where all of the data, code, etc., can fit in one machine, but it takes too long. 

* I'm running an optimization or a complex machine learning task, and it takes 200 minutes per trial
* I'd like to run this on 10 or more machines and get the time down below 20 minutes

So I look at Dask and a cluster

But that requires a parallelizable algorithm...
* how much of my current algorithm is parallelizable (Amdahl's Law)?
* can I (or my teammates, lab assistants, grad students) actually perform the parallelization?
* if we do it, will the result be usable (understandable, publishable, deployable, etc.)?

Suppose I can easily parallelize my code. Yea!
* how much network communication needs to occur between nodes?
    * how fast is my network? where are my nodes?
* how much state needs to be synchronized and/or moved between nodes?
    * if synchronization is needed, how much overhead does this incur?
    
__Example #3: too much data and too much compute for one node__

I have a problem where the data is too large to run locally, and the calculation would likely take too long even if the data were local.

* I'm performing a query that involved joins on multiple large-scale tables (perhaps 10-100TB each)

Dask has a straightforward API (based on known data parallelism and relational principles)

I just need to make a few decisions...
* what size "chunk" of data is the right size to tackle the problem
* how do I access the data? at this scale, just moving chunks of data around gets very expensive
* where do I store scratch/temporary data? do I have enough fast scratch space?
    * how fast is my fast scratch space? how slow is the space I have to fall back to?
    * I can always push data to more distant storage, but that will slow everything down further 
* what is the biggest temporary chunk I need to store in memory? 
    * can I make my cluster nodes large enough to accommodate this and stay within budget?
    * how bad are the results if this data has to get swapped out?
    * how predictable is this behavior?
* if I break my data up into enough pieces then
    * network traffic gets much larger
    * use of temp space gets even bigger
* are there manual optimizations I can do to make this job more tractable (e.g., filter-join-union)?
    * How complex would that be? 
    * Is it something repeatable?
    * How often do we need to run these jobs?

## Dask and Distributed Computing

__Dask has tools to support all of these use cases and an understandable model for managing the computation, but it cannot magically solve all of these scenarios for you.__

Dask provides
* a friendly Python interface
* great visibility into the cluster and computation via dashboards and performance reports

Users will need to get at least passingly familiar with
* networks
* cores
* node/container sizes
* storage options

Administrators will need to
* describe networks for users
* configure networks (virtual/routing)
* manage
    * CPUs and GPUs
    * memory
    * NVMe, SSD, spinning disks, network EBS-style storage, blob (S3) style storage, etc.
    * network capacity
* this may involve procurement and physical operations, or planning/managing cloud infrastructure

## Our Goal Today

__Luckily, our job isn't to solve all of these multidimensional challenges today, or with any single tool or answer__

Our goal is to understand clearly the ...
* major moving parts of Dask
* "knobs" that are available to adjust its behavior
    * Configuration
    * APIs
    * Helper tools
* options for deployment
* "care and feeding" requirements
* top troubleshooting techniques
* principal best practices

We'll also try and include some general Q&A (e.g., an opportunity to discuss use cases) and optionally a demo use case involving bulk scoring with a machine-learning model.