<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal.svg"
     width="60%"
     alt="Dask logo\" />

# Time for a Test Drive!

You've spent some time walking around the Dascar lot, hearing about all the awesome features and specs...

That's enough talk. Let's jump into this racecar and see what it can do...

We'll test drive:

1. Dask DataFrames for faster & scalable pandas
2. Dask Arrays for faster & scalable NumPy
3. Dask-ML for faster & scalable scikit-learn
4. Coiled for cluster spin-up

![](images/race-car.png "Title")

## Dask DataFrames

The pandas car...with the Dask engine!

In [None]:
import dask.dataframe as dd

In [None]:
%run ../prep_data.py -d flights

In [None]:
import os

files = os.path.join('../data', 'nycflights', '*.csv')
files

In [None]:
df = dd.read_csv(files,
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={"TailNum": str,
                        "CRSElapsedTime": float,
                        "Cancelled": bool})

In [None]:
df.head()

In [None]:
%%time
df.groupby("Origin")["DepDelay"].mean().compute()

### A slight difference with pandas
Notice the `.compute()` call: this is necessary because Dask operates using something called **lazy evaluation**.

If you haven't heard about lazy evaluation before, check out [the Beginner's Guide to Distributed Computing](https://towardsdatascience.com/the-beginners-guide-to-distributed-computing-6d6833796318).

In [None]:
df

## Dask Arrays

The Numpy car...with Dask engine superpowers!

In [None]:
import dask.array as da

In [None]:
array = da.random.random((10_000, 10_000), chunks=(1_000, 1_000))

In [None]:
array

In [None]:
array[:10,:5]

In [None]:
array[:10,:5].compute()

In [None]:
%%time
array.sum(axis=1).compute()

## Dask ML

The scikit-learn car with.... you guessed it -- Dask rocketfuel!

In [None]:
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification

In [None]:
X, y = make_classification(n_samples=1_000, chunks=50)

In [None]:
X

In [None]:
y

In [None]:
lr = LogisticRegression()

In [None]:
%%time
lr.fit(X, y)

In [None]:
%%time
predictions = lr.predict(X).compute()

In [None]:
lr.score(X,y).compute()

# Digging Deeper

Dask's lower-level APIs give you even more flexibility and control over what / how to parallelize your custom Python code.

## Parallelize Python Code with `dask.delayed`

In [None]:
from time import sleep

def inc(x):
    """Increments x by one"""
    sleep(1)
    return x + 1

def add(x=0, y=0, z=0):
    """Adds x and y and z"""
    sleep(1)
    return x + y + z

In [None]:
%%time

x = inc(1) # takes 1 second
y = inc(2) # takes 1 second
z = add(x, y) # takes 1 second

In [None]:
z

In [None]:
from dask import delayed

In [None]:
%%time

a = delayed(inc)(1)
b = delayed(inc)(2)
c = delayed(add)(a, b)

In [None]:
c

In [None]:
a.visualize()

In [None]:
b.visualize()

In [None]:
c.visualize()

In [None]:
%%time
c.compute()

In [None]:
d = delayed(inc)(3)

In [None]:
c = delayed(add)(a, b, d)

In [None]:
c.visualize()

In [None]:
%%time
c.compute()

Task graphs can get...complicated:

<img src="https://raw.githubusercontent.com/coiled/pydata-global-dask/master/images/grid_search_schedule.gif"
     width="95%"
     alt="Grid search schedule\" />

## Dask Cluster on Coiled

To launch your own Coiled clusters:
1. Create an account at [cloud.coiled.io](cloud.coiled.io)
2. Open a terminal
3. Create a new conda env and activate it
4. Run `conda install -c conda-forge coiled-runtime`
5. Run `coiled login`

You’ll then be asked to login to the Coiled web interface. Normally you'd navigate to https://cloud.coiled.io/profile where you can create and manage API tokens. This requires setting up some cloud credentials. To bypass that for this tutorial, we'll use a test account that's already set up.

```
Please login to https://cloud.coiled.io/profile to get your token
Token:
```

Copy the following token (removing the "LONDON" in the middle) and press Enter:

`aea6c94125e64d8f839e9c7719537ca4-c48ca9434221c4d39b65b9266901c3956065a6cd`
    
This token will be destroyed immediately after this tutorial. To continue using Coiled after the tutorial, connect your Coiled account to your AWS/GCP cloud by following the steps [here](https://docs.coiled.io/user_guide/backends.html).

In [None]:
import coiled

In [None]:
# coiled.create_software_environment(
#     account="pydata-london",
#     conda="../binder/environment.yml",
#     name="dask-tutorial",
# )

In [None]:
# create a unique identifier for your cluster
import random
your_name = "INSERT-YOUR-NAME-HERE" 
unique_id = your_name + str(random.randint(100,200))

# spin up the cluster
cluster = coiled.Cluster(
    name=f"dask-tutorial-{unique_id}", 
    n_workers=20, 
    worker_memory='16Gib',
    software="pydata-london/dask-tutorial",
    scheduler_options={'idle_timeout':'2 hours'}, # default is 20min
    shutdown_on_close=False,
)

In [None]:
from distributed import Client

client = Client(cluster)

In [None]:
import dask.dataframe as dd

In [None]:
df = dd.read_csv(
    "s3://nyc-tlc/csv_backup/yellow_tripdata_2019-*.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
        "store_and_fwd_flag": "category",
        "PULocationID": "UInt16",
        "DOLocationID": "UInt16",
    },
    storage_options={"anon": True},
    blocksize="16 MiB",
)

In [None]:
df

In [None]:
%%time
df.groupby("passenger_count").tip_amount.mean().compute()

Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/dask-tutorial/lib/python3.9/site-packages/distributed/comm/tcp.py", line 439, in connect
    stream = await self.client.connect(
  File "/Users/rpelgrim/mambaforge/envs/dask-tutorial/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/dask-tutorial/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/dask-tutorial/lib/python3.9/site-packages/distributed/comm/core.py", line 289, in connect
    comm = await asyncio.wait_for(
  File "/Users/rpelgrim/mambaf