# Introduction

__Welcome!__

In a short period of time, we are going to take a comprehensive journey from using Python to scaling and operating large Python clusters with Dask. 

Our work will include... 
* querying data, 
* transforming data for reporting or machine learning purposes
* running typical machine learning jobs
* creating custom data science applications like simulations
* getting familiar with the nuts and bolts of standing up Dask clusters
* putting down that wrench and running Dask the easy way
* understanding what really happens inside the machinery
* learning all about those futuristic dashboards that help us run things
* absorbing best practices, hints, and tips that we can use and share with our teams

## Instructor and Admin Details

### Adam Breindel

<img src='images/med-head.jpg' width=250>

__LinkedIn__ - https://www.linkedin.com/in/adbreind

__Email__ - adam@coiled.io

__Twitter__ - <tt>@adbreind</tt>

* 20+ years building systems for startups and large enterprises
* 10+ years teaching front- and back-end technology

__Fun large-scale data projects...__
* Streaming neural net + decision tree fraud scoring
* Realtime & offline analytics for banking
* Music synchronization and licensing for networked jukeboxes

__Industries__
-   Finance / Insurance
-   Travel, Media / Entertainment
-   Energy, Government
-   Advertising/Social Media, & more

### Class Logistics

* Main schedule (dates/times)
* Breaks

We totally understand how challenging it is to try and focus/absorb a ton of new stuff for hours at a time in front of a screen, especially since your "regular" job may not stop to give you the time off.

__What am I expected to know already?__

Maybe some Python and a few parts of the SciPy/PyData stack. *But if we cover an area you area you are not familiar with, please don't hesitate to ask about it!* __PyData is huge and no one knows all of it!__ Everyone's work is different, so their expertise and backgrounds are different. This is a __good__ thing and there are __no bad questions__.

### Materials

Everything we do is in these notebooks (and accompanying data, etc.) -- no PowerPoint or PDFs!

In addition to having hands-on, runnable code throughout, we'll also have a handful of slightly longer labs (usually around 10-20 minutes) so that you can try some mini-projects on your own.

For the class we'll use cloud-based versions, so no need to install!

__You May Want to Get/Keep the Materials for Review Later__

You can download a .zip file with the full contents (including an "environment" file that lists all dependencies). I'll supply the URL in the live chat. That zip file will be up permanently, so you don't need to rush; you can get it any time.

*Running the Environment Locally*

The easiest way to get everything set up locally (and also keep everything encapsulated, so it won't mess with your other Python projects) is to download and install Anaconda (or Miniconda) from https://www.anaconda.com/distribution/

Anaconda lets you easily manage multiple, isolated Python environments, each with their own dependencies.

Once you have Anaconda set up, the basic way to install the environment is to create a new conda env for this class with the --file option and point at binder/environment.yml

If you're in a hurry, there's a Conda cheat sheet with basic details at https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf

Now that your conda environment is set up, and you've switched to it, you probably want to install the Dask JupyterLab extension, described here: https://docs.coiled.io/user_guide/jupyter.html#dask-jupyterlab-extension

Once you're in the root of the course materials folder, you can run `jupyter lab` at the command line to launch a browser that should come up very much like this one!

*Running the environment in the cloud with Coiled Cloud*

We'll discuss this further in class, but we would love for you to work with these materials -- and your own compute challenges -- on our Coiled Cloud. We want to make this easy for you and free!

### Class Tooling

Although our focus is on Dask, we'll be using JupyterLab (this notebook environment) as our "front-end" interface. We'll explain more details as we go, but here are a couple of minimal "survival guide" style tips:

* To run code in a notebook cell, select the cell with keyboard or mouse and then hit CTRL+ENTER
    * There are other ways to run code, and you're welcome to use them, but this is the simplest way to get started
* To insert a new cell, either
    * Hit ESC and then B
        * Hint: the ESC is only necessary if you're in "editing mode" in a cell, otherwise you can skip it
    * or click the '+' toolbar icon at the top of the notebook

__Try It!__

Insert a cell, enter a very serious computation like `3+4` and then execute it.

## Cascadia City

In this class, we'll organize our activities around real data and a realistic (though fictional) scenario. We've been recruited by the data and planning office of Cascadia City, somewhere in the Pacfic Northwest.

<img src='images/forest.jpg'>

We will work on a number of proof-of-concept projects to see how Dask can help the city analyze data and solve problems.

Since Cascadia City may grow to be a bit like Seattle, Washington, we'll use a number of public datasets from Seattle to facilitate our work as well as a some fictional data, including
* Library loans
* Pet registrations
* Fire risk and emergency response
* Land use data
* Imagery

__In the first session(s)__ we'll wear our data analyst and data science hats, and tackle user-facing use cases: we'll learn to use Dask's APIs to solve problems for the city

__In later session(s)__ we'll put on our ops and support hats, and make sure we can really understand how Dask works, how we can deploy it for the rest of the team, and how to debug and troubleshoot so we can help them (and ourselves).

## Dask: Scaling Python Simply

Dask is a distributed compute system for Python which scales efficiently from a single laptop up to thousands of servers. Dask is developed in ongoing collaboration with the PyData community so that it is easy to learn, integrate, and operate. Dask leverages regular Python code to scale the work you already do using the skills you already have.

### Why scale?

In a nutshell,
* In the 1900s, computer processors got faster: they ran more and more instructions per second
* But, in the 2000s, due to a variety of engineering limitations, we can no longer acquire strictly faster processors. Instead we use more processors in collaboration
* Many computations are faster if we can hold a dataset in local memory: while individual computers with huge memory do exist, it's usually easier and cheaper to use a collection of computers and "pool" their memory to store our data

### Why Python?

There are lots of wonderful languages, and we don't get "religious" about any of them. However,
* There is a huge and popular ecosystem of data and science tooling in Python, and if that works for you, you've come to the right place
* Python is easy and fast to read and write, so it supports a high-productivity workflow ... provided we can compute fast enough

### Why simple?

If you are an academic, professional, or hobbyist computer science researcher, you may want to investigate extremely complex systems that can do unusual and clever things.

But, for most users, dealing with the complexity of our computing tools is a headwind that we would like to avoid. Especially if that complexity seems hidden for a while ... and then suddenly overwhelms us when we need to deal with an edge case, or debug, or engage in performance tuning.

### Where are the docs?

We'll provide more links to documentation as we go, but here's a quick list you can refer to:
* Main project page https://dask.org/
* Core documentation https://docs.dask.org/en/latest/
* Distributed (scheduler) https://distributed.dask.org/en/latest/
* Machine learning https://ml.dask.org/
* Deployment tools
    * Kubernetes https://kubernetes.dask.org/en/latest/
    * AWS or Azure https://cloudprovider.dask.org/en/latest/
    * YARN https://yarn.dask.org/en/latest/

## What is it like to use Dask?

There are 3 main ways that people use Dask, and you can use any combination or all of them.

We'll take a quick test drive and see each of the 3 approaches
* One-liners (or sometimes "zero-liners") where a tool already has Dask integration built in
* Dask large-scale datastructures like Dask Dataframe: a scalable Pandas dataframe
* Parallelizing custom computation: use the Dask engine to power your own code

Let's connect to our hosted Coiled cluster:

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)

client

Congratulations, you've got your first Dask cluster.

But how can we be sure it's alive and has the right specs?

__Workers Dashboard__

Let's take a look at the Workers dashboard panel
* Click the Dask logo in the JupyterLab side toolbar
* Click the &#x1F50D; icon to the right of the text box
* Click the Workers button

You should get a tab with a live, animated chart that looks something like this:

<img src='images/workers.png' width=700>

You can drag/position/snap that in JupyterLab, so that it's visible while you're coding.

What are these "workers"? Just regular Python processes!

Ok, let's try a "one-liner" ML application using Dask. We'll run an example from TPOT, an AutoML tool, to classify the "digits" dataset from Scikit-Learn (a set of very low-resolution handwritten digits)

In [None]:
import tpot
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.5)

In [None]:
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    config_dict=tpot.config.classifier_config_dict_light,
    use_dask=True,
)
tp.fit(X_train, y_train)

# quick look at test-set accuracy

sum(tp.predict(X_test) == y_test)/len(y_test)

Notice that the only Dask code we explicitly wrote there was a *kwarg* `use_dask=True` 

Next, let's take a very quick look at one of the Dask datastructures -- a parallel dataframe.

We'll look at some records from the Seattle library system

In [None]:
import dask.dataframe as ddf

loans = ddf.read_csv('s3://coiled-training/data/checkouts-small.csv', storage_options={"anon": True})

loans.head()

In [None]:
loans['Checkouts'].sum().compute()

For the last stop on our quick preview of Dask, let's parallelize some custom code.

In [None]:
import random

def roll_die(sides):
    return random.randint(1,sides)

Local (regular) Python to roll 4d6

In [None]:
results = map(roll_die, [6] * 4)

print(list(results))

Using our Dask cluster to roll 4d6 in parallel:

In [None]:
import dask

roll_die = dask.delayed(roll_die)

In [None]:
cluster_work = map(roll_die, [6] * 4)

dask.compute(*cluster_work)

### Remote (or cloud) clusters vs. local cluster

In class, we're using a hosted Coiled cluster. But creating a Dask cluster locally, or on another cluster manager (like Kubernetes) or a public cloud (like AWS) isn't very different.

A local cluster (e.g., for working only on your laptop) looks like this

```python
cluster = LocalCluster(n_workers=4, threads_per_worker=1, memory_limit='2GiB')
client = Client(cluster)
```

One way to start a cluster on Kubernetes looks like

```python
cluster = KubeCluster.from_yaml('worker-spec.yml')
client = Client(cluster)
```

and one way to start a cluster on AWS Fargate (container service) looks like

```python
cluster = FargateCluster(image="<hub-user>/<repo-name>[:<tag>]")
client = Client(cluster)
```

We'll look at more options later.

## Some more Dask Community resources

* Dask issues and source code https://github.com/dask
* StackOverflow https://stackoverflow.com/questions/tagged/dask
* Dask Github Discussions https://github.com/dask/dask/discussions
* Gitter https://gitter.im/dask/dask
* Slack
  * Dask https://dask.slack.com/
  * Coiled Community https://coiled-users.slack.com