# Use External Dask Cluster from Local Environment

In this example, you'll learn how to create and interact with a Dask cluster from your local environment using the Saturn Cloud service. This allows you to skip interacting with the Saturn Cloud UI almost entirely, if you want to.

While we're using Jupyter locally to demonstrate, you can apply this technique for scripting or other kinds of ML workflows. 

<img src="dask-cluster.png" width = 500px>

As this diagram illustrates, the pieces in the gray box constitute the cluster, and that's what will be hosted on Saturn Cloud. Instead of the pink box (the Client) being a Jupyter instance also on Saturn Cloud, this will be your local machine.


> This tutorial does not go into great detail about the underlying concepts of Dask, but we have [reference material for those who need more information](https://www.saturncloud.io/docs/reference/dask_concepts/).

## Create Connection to Saturn Cloud

***

The API Token used below (kept in `config.json`) is your user token, which you can retrieve at `https://app.community.saturnenterprise.io/api/user/token` (or fill in the prefix that represents your enterprise account URL).

> **Protect your token, as it allows access to your account!**

***

In [1]:
# Load token
import json

with open('config.json') as f:
  data = json.load(f)

In [2]:
# Connect to Saturn Cloud
from saturn_client import SaturnConnection

saturn_connection = SaturnConnection(
    url='https://app.community.saturnenterprise.io', 
    api_token=data['api_token']
)

In [3]:
saturn_connection

<saturn_client.core.SaturnConnection at 0x110b48d90>

## Create Project (if needed)

If you haven't set up a project inside Saturn Cloud, you can do that programmatically from here. If you have already set up the project, you need to know the `project_id`, which you can grab from the project URL.

Here, the project_id is `1133af4131124f3bb13bda367eba2b52`.

Notice that it will create a new project if you run the commented-out next chunk, even if a project by that name is already in place.

<img src="project_id.png" width=750 alt="screenshot of project ID">

In [4]:
# project = saturn_connection.create_project(
#      name="external-demo",
#      image_uri='saturncloud/saturn:2021.02.22',
#  )
# project_id = project['id']
# project_id

In [5]:
project_id = '9cb3f1c325ee4f4f9a81193fac40a2ca'

## Connect to Project

Now, you will create an External Connection to this project, allowing you to interact with it from this notebook. Your user token is again required.

In [6]:
from dask_saturn.external import ExternalConnection
from dask_saturn import SaturnCluster
import dask_saturn
from dask.distributed import Client, progress

conn = ExternalConnection(
    project_id=project_id,
    base_url='https://app.community.saturnenterprise.io',
    saturn_token=data['api_token']
)
conn

<dask_saturn.external.ExternalConnection at 0x1126eb6d0>

## Set Up Cluster

Finally, you are ready to set up a cluster in this project! You'll see info messages logging here until the cluster is started and ready to use.

If you have a cluster already created on the project, here you can just start it up without creating a new one, using this same code. You can also ask it to change size using `cluster.scale()`. For more details, we have [documentation about managing clusters](https://www.saturncloud.io/docs/getting-started/create_cluster/).

In [7]:
cluster = SaturnCluster(
    external_connection=conn,
    n_workers=4,
    worker_size='8xlarge',
    scheduler_size='2xlarge',
    nthreads=32,
    worker_is_spot=False)


INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Cluster is ready
INFO:dask-saturn:Registering default plugins
INFO:dask-saturn:{}


*(I have Python 3.8 locally, while the cluster is using 3.7, and the system will warn me, but this isn't going to be a problem here. This is the sort of thing you might encounter yourself when mixing local and Saturn environments.)*

After this point, you can use this cluster as you might use any other. Here, I will load data from my local environment, convert it to a Dask distributed data object, and manipulate it with my cluster.

## Create Client Object

This lets us connect from our local environment to this new cluster, and when we call the object it gives us a link to the Dask Dashboard for that cluster. We can watch at this link to see how the cluster is behaving.

In [8]:
client = Client(cluster)
client.wait_for_workers(4)
client

0,1
Client  Scheduler: tls://d-steph-external-demo-8692d7c816bf47e2a2d34baba0dde817.community.saturnenterprise.io:8786  Dashboard: https://d-steph-external-demo-8692d7c816bf47e2a2d34baba0dde817.community.saturnenterprise.io,Cluster  Workers: 4  Cores: 128  Memory: 1.02 TB



+---------+---------------+----------------+----------------+
| Package | client        | scheduler      | workers        |
+---------+---------------+----------------+----------------+
| numpy   | 1.19.5        | 1.20.1         | 1.20.1         |
| python  | 3.8.6.final.0 | 3.7.10.final.0 | 3.7.10.final.0 |
+---------+---------------+----------------+----------------+


## Load Data

At this point, we can load in our dataset, which for me is a set of just over 60 CSV files in an S3 repository. The total dataset represents more than 12 million rows of purchase records, with 23 columns. I am going to load just one file from this set for the demo.

I am loading directly to Dask - however if you have a flat file in a local directory, you can very easily load into pandas here and then convert to Dask.

In [None]:
%%time
import os
import pandas as pd
import dask.dataframe as dd


In [None]:
import s3fs
s3 = s3fs.S3FileSystem(anon=True)
s3fpath = 's3://saturn-public-data/ia_data/ia_10.csv'

iowa = dd.read_csv(
    s3fpath,
    parse_dates = ['Date'],
    engine = 'python',
    dtype={'Zip Code': 'object'},
    error_bad_lines = False,
    warn_bad_lines = False,
    storage_options={'anon': True},
    assume_missing=True
)

# Comment out below if using multiple files
iowa = iowa.repartition(npartitions = 4)

In [None]:
%%time
from dask.distributed import wait

iowa = iowa.persist()
_ = wait(iowa)
iowa.columns

Runtime with 12 million rows: CPU times: user 138 ms, sys: 16 ms, total: 154 ms
Wall time: 1min 8s

## Run Analyses

To demonstrate an analysis on the cluster, I'll do a couple of analyses that you might want to run for business.

The first task to do aggregations across dataframes effectively with Dask is to **set the index of the dataframe**. This lets Dask easily organize the data that is partitioned across the cluster, while still keeping it distributed. 

> This is sometimes a slow task, but it only needs to be done once.

In [None]:
%%time

iowa = iowa.set_index("Date")
iowa = iowa.persist()
_ = wait(iowa)


Runtime with 12 million rows: CPU times: user 549 ms, sys: 52.6 ms, total: 601 ms
Wall time: 4min 37s

### Create a Rolling Average

From here, we can treat the dataframe very much like a pandas dataframe, but it remains distributed.   
We'll calculate a new series, which is the 30 day rolling average of items sold (bottles), then shape it into a dataframe.

In [None]:
%%time

bottles_sold_roll = iowa['Bottles Sold'].rolling('30D').sum()
bottles_sold_roll = bottles_sold_roll.to_frame(name="bottles_sold_roll")
bottles_sold_roll = bottles_sold_roll.persist()

Runtime with 12 million rows: 
CPU times: user 32.1 ms, sys: 1.45 ms, total: 33.5 ms
Wall time: 32.5 ms

In [None]:
%%time

bottles_sold_roll.head()

Runtime with 12 million rows: 
CPU times: user 7.78 ms, sys: 1.86 ms, total: 9.64 ms
Wall time: 1.38 s

### Group and Summarize

For a second example of calculations over the dataset on the cluster, I'll group by store and date, and calculate the store level daily sales in dollars.

In [None]:
%%time

iowa['Sale (Dollars)'] = iowa['Sale (Dollars)'].str.lstrip('$').astype('float')

Runtime with 12 million rows: 
CPU times: user 8.34 ms, sys: 150 µs, total: 8.49 ms
Wall time: 8.45 ms

In [None]:
%%time

sum_store_sales = iowa.groupby(['Date', "Store Number"])["Sale (Dollars)"].sum()
sum_store_sales = sum_store_sales.to_frame(name="sum_store_sales")
sum_store_sales = sum_store_sales.persist()

Runtime with 12 million rows: 
CPU times: user 29.1 ms, sys: 1.64 ms, total: 30.7 ms
Wall time: 29.7 ms

In [None]:
%%time

sum_store_sales.head()

Runtime with 12 million rows: 
CPU times: user 58.7 ms, sys: 8.11 ms, total: 66.8 ms
Wall time: 40.1 s

## Combine Dataframes

If you want to, from here you can rejoin those new columns to your existing data using the indices.

In [None]:
%%time

iowa_new = dd.concat([iowa, bottles_sold_roll], axis=1)
iowa_new = iowa_new.persist()
_ = wait(iowa_new)

Runtime with 12 million rows:
CPU times: user 128 ms, sys: 11.2 ms, total: 139 ms
Wall time: 29.7 s

In [None]:
%%time

iowa_final = iowa_new.merge(sum_store_sales, how="left",
                            on=['Date', "Store Number"])
iowa_final = iowa_final.persist()
_ = wait(iowa_final)

Runtime with 12 million rows:
CPU times: user 72.6 ms, sys: 6.27 ms, total: 78.8 ms
Wall time: 6.52 s

## View Data

If you examine this object, you end up seeing the shape of the dataframe but not the contents - this is a function of its distributed nature.

In [None]:
%%time

iowa_final

Runtime with 12 million rows:
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.25 µs

However, if we check the head of this object, we can see the actual values. This may take time, because part of the dataframe must be computed to show the values.

In [None]:
%%time
iowa_final.head()

Runtime with 12 million rows:
CPU times: user 8.63 ms, sys: 1.94 ms, total: 10.6 ms
Wall time: 110 ms

In [None]:
%%time

iowa_final[iowa_final['Store Number'] == 2649].head()

In [None]:
len(iowa_final)

## Return to pandas

At this point, you can use this dataset for whatever next steps you have - that might include passing it to a machine learning workflow, for example.

If you need to use the data in a way that is not Dask compatible, and the data is small enough, you can return it to a pandas dataframe with this command. Because this means all the computations are run, and the data is consolidated into the Client environment, it can be slow.

In [None]:
%%time

iowa_pd = iowa_final.compute()
type(iowa_pd)

## Housekeeping

Because we are not working inside the UI, we want to make sure that we close down any resources when we are done- otherwise undesired costs can be incurred.

To shut down the cluster entirely:

In [None]:
client.close()

In [None]:
cluster.close()

In [None]:
#client.restart()