# workflows

Complex machine learning applications often require multi-stage pipelines (e.g., data loading, transforming, training, testing, iterating). [**Workflows**](https://spell.ml/docs/workflow_overview/) in Spell allow you to manage these pipelines as a sequence of Spell runs, and are a lightweight alternative to tools like [Airflow](https://airflow.apache.org/) and [Luigi](https://github.com/spotify/luigi) for managing your model training pipelines.

Workflows can be launched using either the Spell CLI or the Spell Python API. In this tutorial we demonstrate both approaches by example.

## understanding workflows

Every workflow consists of one *master run* and one more more *worker runs*. The master run is responsible for control flow: that is, determining which worker runs should get executed when, and why. The worker runs then do all of the work required. For example:

![](https://i.imgur.com/W5Ugs0S.png)

In this diagram the master run coordinates the sequential execution of three worker runs. More complex workflows may require more complicated control flow.

## understanding the workflow script

The **workflow script** is what gets executed on the master run: a Python script using the Spell Python API to define worker jobs and the control flow logic surrounding them. Here is a simple example:

In [None]:
%%writefile simple.py
import spell.client
client = spell.client.from_environment()

print(client.active_workflow)

r1 = client.runs.new(command="echo Hello World! > foo.txt")
r1.wait_status(*client.runs.FINAL)
r1.refresh()
if r1.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r1.id}")

r2 = client.runs.new(
    command="cat /mnt/foo.txt",
    attached_resources={f"runs/{r1.id}/foo.txt": "/mnt/foo.txt"}
)
r2.wait_status(*client.runs.FINAL)
r2.refresh()
if r2.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r2.id}")

print("Finished workflow!")

Let's walk through this script step-by-step:

```python
import spell.client
client = spell.client.from_environment()
```

This initializes the client object. If you are not familiar with our Python API, check out the [Python API Reference](http://spell.run/docs/python) to learn more.


```python
print(client.active_workflow)
```

You can use this variable to determine which workflow the script is currently executing in. In the case that this script is not being run from inside of a workflow this will be set to `None`.

```python
r1 = client.runs.new(command="echo 'Hello World!' > foo.txt")
```

This next block of code executes a new run, one which creates a file containing `Hello World!` on disk. [This file automatically gets saved to SpellFS.](https://spell.ml/docs/run_overview/#saving-resources)

```python
r1.wait_status(*client.runs.FINAL)
r1.refresh()
if r1.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r.id}")
```

We can only proceed to the next stage of the workflow when the first stage completes successfully. This next bit of code is a control flow block that achieves this.

Every run transitions through a sequence of states as part of its execution: `machine_requested`, `running`, `pushing`, and so on. Runs eventually transition to a so-called **final state**: the state that the run is assigned at the end of its execution. There are four different possible final states, the most important of which is `COMPLETE`. A run which terminates in the `COMPLETE` state is one which has successfully run all of its code and pushed all of its outputs to SpellFS.

This `wait_status` methods blocks execution until the run API reports that the run has reached a final state. We then `refresh` the information on the run object (this has to be done manually because it requires a network roundtrip) and check if the `r.status` field reports that the run is `COMPLETE`. We only proceed with the rest of the script if it is&mdash;if it is not, e.g. if the run reached a failing final state (`FAILED`, `STOPPED`, or `INTERRUPTED`), we raise an error instead.

```python
r2 = client.runs.new(
    command="cat /mnt/foo.txt",
    attached_resources={f"runs/{r1.id}/foo.txt": "/mnt/foo.txt"}
)
r2.wait_status(*client.runs.FINAL)
r2.refresh()
if r2.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r.id}")
```

The next code block creates another Spell run. This time instead of writing `Hello World!` to disk, we mount the `foo.txt` file we created in `r1` into the run. We then `cat` it (print it out to `stdout`), which will cause it to show up in the run logs.

## executing the workflow script

You can execute the workflow script using the Spell CLI:

In [31]:
!spell workflow "python simple.py"

[0m✨ Preparing uncommitted changes…
[0mEnumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 12 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 649 bytes | 649.00 KiB/s, done.
Total 5 (delta 4), reused 0 (delta 0)
To git.spell.run:aleksey/e6cee8710721a8ef6f3d2924713ac7d351c972ca.git
 * [new branch]      HEAD -> br_9beb42bead69bba7ca10038c6207ac35601c371b
💫 Casting workflow #14…
[0m✨ Following workflow at run 350.
[0m✨ Stop viewing logs with ^C
[0m[K[0m[?25h[0m✨ Building… donecode[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m
[0m✨ [0mRun is running
[0m[K[0m[?25h[0m✨ Machine_Requested… done-- waiting for a CPU machine..[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m

We can verify that this workflow executed successfully by checking the run logs of the last worker run:

In [43]:
!spell logs 352

[0m[K[0m[?25h[0m✨ Machine_Requested… done
[0m[K[0m[?25h[0m✨ Building… done
[0m[K[0m[?25h[0m✨ Mounting… done
[0m[0m✨ [0mRun is running
[0mHello World!
[0m[K[0m[?25h[0m✨ Saving… done
[0m[K[0m[?25h[0m✨ Pushing… done
[0m🎉 [0mTotal run time: 11.525986s
[0m🎉 [0mRun 352 complete
[0m[K[0m[?25h[0m[0m

## a more complex example

As with any run, the code environment in a worker run can be initialized from a GitHub repository using the `--github-url` flag.

However, with more complex pipelines it is sometimes useful to make the exact model code used a runtime variable. To support this use case, the Python API additionally supports initializing the code environment from a local `git` repository inside of the master run using the `--repo` flag.

The following example demonstrates how this feature works. This workflow downloads a CIFAR10 dataset in one run, and backs that data up to disk. In a second run, it mounts the data downloaded in the first run to disk and trains a model on it.

Note the use of the `commit_label` flag on the `run` command; this tells the run to initialize the code environment using the repository with the label `char-rnn`. It is the responsibility of the user to set this value accordingly.

In [2]:
%%writefile workflow.py
import spell.client

client = spell.client.from_environment()


# Helper function. Throws a ValueError if the run failed.
def raise_if_failed(run):
    if run.status in [
        client.runs.FAILED,
        client.runs.BUILD_FAILED,
        client.runs.MOUNT_FAILED,
    ]:
        raise ValueError(f"Run #{run.id} failed with status `{run.status}`.")
    if run.user_exit_code != 0:
        raise ValueError(
            f"Run #{run.id} finished with nonzero exit code " f"{run.user_exit_code}."
        )


# The first run downloads the training dataset
cmd = """
import torchvision
transform_train = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
torchvision.datasets.CIFAR10("/spell/cifar10/", train=True, transform=transform_train, download=True)
"""
r1 = client.runs.new(command=f"python -c '{cmd}'")
print(f"Waiting for run {r1.id} to complete")
r1.wait_status(*client.runs.FINAL)
r1.refresh()
raise_if_failed(r1)

# The second run trains a model on this dataset
r2 = client.runs.new(
    machine_type="t4",
    command="python models/train_basic.py",
    attached_resources={f"runs/{r1.id}/cifar10": "/mnt/cifar10/"},
    commit_label="cnn-cifar10",
)
print(f"Waiting for run {r2.id} to complete")
r2.wait_status(*client.runs.FINAL)
r2.refresh()
raise_if_failed(r2)

Overwriting workflow.py


To run this workflow we will need the following model code:

In [1]:
!git clone https://github.com/spellml/cnn-cifar10.git

Cloning into 'cnn-cifar10'...
remote: Enumerating objects: 159, done.[K
remote: Counting objects: 100% (159/159), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 159 (delta 69), reused 126 (delta 39), pack-reused 0[K
Receiving objects: 100% (159/159), 544.17 KiB | 1.81 MiB/s, done.
Resolving deltas: 100% (69/69), done.


Finally, when we execute this workflow, we parameterize the repo label using the `--repo` flag:

In [2]:
!spell workflow create \
    --repo cnn-cifar10=cnn-cifar10 \
    "python workflow.py"

See also the `with-metrics` directory for another example.