# Practical Example - Distributed Data

This example runs a simple backtest on a large dataset (~29GB compressed) using Dask.

We will distribute the dataset over the Dask cluster, run the analysis, and get the results.

We will show how to do the following:

* Connect to the Dask cluster.
* Load the data from a shared file system.
* Use Dask dataframes to store the data.
* Run the analysis.


## Imports

In [None]:
import dask.dataframe as dd
import numpy as np
from dask.diagnostics import ProgressBar
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

## Starting the Dask Cluster

In [None]:
n_workers = 5
cluster = SaturnCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers=n_workers)
client.restart()

## Read in the File
This file is approximately 28GB compressed. It is stored in parquet format, but could be any supported file type (e.g., csv, json)

In [None]:
ddf = dd.read_parquet(
    "/home/jovyan/shared/nathan/poc-gsa/datasets/stocks/stock_data.pq"
)

## Persist the file
This is not strictly necessary, but can be useful if you are doing analysis on the file. This method saves the data to the Dask workers' memory. If you do not persist, any time you call `.compute()` in this sequence, you will re-load the file.

`wait()` halts the progress until the persistance is done. This avoids some weird errors

In [None]:
ddf = ddf.persist()
_ = wait(ddf)

## Conduct the Backtest
This is a simple moving average crossover strategy. Nothing special here; this should look very similar to pandas code.

The main difference is the introduction of `meta=(column, type)`. Because Dask is lazy and doesn't look at the whole column until `.compute()` is called, it can sometimes get column types wrong. Specifying the column types directly is usually a good idea for this reason.

In [None]:
ddf["signal"] = (
    ddf["ask_close"].rolling(5 * 60).mean() - ddf["ask_close"].rolling(20 * 60).mean()
)

ddf["position"] = (ddf["signal"].apply(np.sign, meta=("ask_close", "float64")) + 1) / 2

ddf["return"] = ddf["position"].shift(1) * ddf["ask_close"].apply(
    np.log, meta=("ask_close", "float64")
).diff(1)

ddf["total"] = ddf["return"].cumsum().apply(np.exp, meta=("return", "float64"))

ddf_results = ddf[["total"]]

## Compute
Lastly, we compute the Dask dataframe using the `.tail()` command. This takes the data from Dask worker memory and brings it back to the client machine. We are only returning one value, so there should be no memory issues here, but be cognizant of how much data you are bringing back to your local machine so as not to crash the kernel.

In [None]:
total_returns = ddf["total"].tail(1)

print(total_returns)