<div>
<center><img src="Flux-logo.svg" width="400"/>
</div>

# Flux RADIUSS Tutorial on AWS

> What is Flux Framework? 🤔️
 
Flux is a flexible framework for resource management, built for your site. The framework consists of a suite of projects, tools, and libraries which may be used to build site-custom resource managers for High Performance Computing centers. Flux is a next-generation resource manager and scheduler with many transformative capabilities like hierarchical scheduling and resource management (you can think of it as "fractal scheduling") and directed-graph based resource representations.

> I'm ready! How do I do this tutorial? 😁️

To step through examples in this notebook you need to execute cells. To run a cell, press Shift+Enter on your keyboard. If you prefer, you can also paste the shell commands in the JupyterLab terminal and execute them there.
Let's get started! To provide some brief, added background on Flux and a bit more motivation for our tutorial, "Shift+Enter" the cell below to watch our YouTube video!

In [None]:
%%html
<iframe width="640" height="360" 
    src="https://www.youtube.com/embed/YIwt51dyXOE" 
    title="YouTube video player" 
    frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" 
    allowfullscreen>
</iframe>

# Getting started with Flux

The code and examples that this tutorial is based on can be found at [flux-framework/Tutorials](https://github.com/flux-framework/Tutorials/tree/master/2023-RADIUSS-AWS). You can also find the examples one level up in the flux-workflow-examples directory in this JupyterLab instance.

## Resources

> Looking for other resources? We got you covered! 🤓️

 - [https://flux-framework.org/](https://flux-framework.org/) Flux Framework portal for projects, releases, and publication.
 - [Flux Documentation](https://flux-framework.readthedocs.io/en/latest/).
 - [Flux Framework Cheat Sheet](https://flux-framework.org/cheat-sheet/)
 - [Flux Glossary of Terms](https://flux-framework.readthedocs.io/en/latest/glossary.html)
 - [Flux Comics](https://flux-framework.readthedocs.io/en/latest/comics/fluxonomicon.html) come and meet FluxBird - the pink bird who knows things!
 - [Flux Learning Guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html) learn about what Flux does, how it works, and real research applications 
 - [Getting Started with Flux and Go](https://converged-computing.github.io/flux-go/)
 - [Getting Started with Flux in C](https://converged-computing.github.io/flux-c-examples/) *looking for contributors*

To read the Flux manpages and get help, run `flux help`. To get documentation on a subcommand, run, e.g. `flux help config`.  Here is an example of running `flux help` right from the notebook. Yes, did you know we are running in a Flux Instance right now?

In [None]:
!flux help

Did you know you can also get help for a specific command? For example, let's run, e.g. `flux help jobs` to get information on a sub-command:

In [None]:
#flux help jobs

### You can run any of the commands and examples that follow in the JupyterLab terminal. You can find the terminal in the JupyterLab launcher.
If you do `File -> New -> Terminal` you can open a raw terminal to play with Flux. You'll see a prompt like this: 

`ƒ(s=4,d=0) fluxuser@6e0f43fd90eb:~$`

`s=4` indicates the number of running Flux brokers, `d=0` indicates the Flux hierarchy depth. `@6e0f43fd90eb` references the host, which is a Docker container for our tutorial.

# Creating Flux Instances

A Flux instance is a fully functional set of services which manage compute resources under its domain with the capability to launch jobs on those resources. A Flux instance may be running as the default resource manager on a cluster, a job in a resource manager such as Slurm, LSF, or Flux itself, or as a test instance launched locally.

When run as a job in another resource manager, Flux is started like an MPI program, e.g., `srun [OPTIONS] flux start [SCRIPT]`. Flux is unique in that a test instance which mimics a multi-node instance can be started locally with simply `flux start --test-size=N`. This offers users to a way to learn and test interfaces and commands without access to an HPC cluster.

To start a Flux session with 4 brokers in your container, run:

In [None]:
!flux start --test-size=4 flux getattr size

The output indicates the number of brokers started successfully.

## Flux uptime
Flux provides an `uptime` utility to display properties of the Flux instance such as state of the current instance, how long it has been running, its size and if scheduling is disabled or stopped. The output shows how long the instance has been up, the instance owner, the instance depth (depth in the Flux hierarchy), and the size of the instance (number of brokers).

In [None]:
!flux uptime

# Submitting Jobs to Flux
## Submission CLI
### `flux`: the Job Submission Tool

To submit jobs to Flux, you can use the `flux` `submit`, `run`, `bulksubmit`, `batch`, and `alloc` commands.  The `flux submit` command submits a job to Flux and prints out the jobid. 

In [None]:
!flux submit hostname

`submit` supports common options like `--nnodes`, `--ntasks`, and `--cores-per-task`. There are short option equivalents (`-N`, `-n`, and `-c`, respectively) of these options as well. `--cores-per-task=1` is the default.

In [None]:
!flux submit -N1 -n2 sleep inf

In [None]:
!flux submit --help

The `flux run` command submits a job to Flux (similar to `flux submit`) but then attaches to the job with `flux job attach`, printing the job's stdout/stderr to the terminal and exiting with the same exit code as the job:

In [None]:
!flux run hostname

The output from the previous command is the hostname (a container ID string in this case). If the job exits with a non-zero exit code this will be reported by `flux job attach` (occurs implicitly with `flux run`). For example, execute the following:

In [None]:
!flux run /bin/false

A job submitted with `run` can be canceled with two rapid `Cltr-C`s in succession, or a user can detach from the job with `Ctrl-C Ctrl-Z`. The user can then re-attach to the job by using `flux job attach JOBID`.

`flux submit` and `flux run` also support many other useful flags:

In [None]:
!flux run -n4 --label-io --time-limit=5s --env-remove=LD_LIBRARY_PATH hostname
!flux run --help

The `flux bulksubmit` command enqueues jobs based on a set of inputs which are substituted on the command line, similar to `xargs` and the GNU `parallel` utility, except the jobs have access to the resources of an entire Flux instance instead of only the local system.

In [None]:
!flux bulksubmit --watch --wait echo {} ::: foo bar baz

The `--cc` option to `submit` makes repeated submission even easier via, `flux submit --cc=IDSET`:

In [None]:
!flux submit --cc=1-10 --watch hostname

Try it in the JupyterLab terminal with a progress bar and jobs/s rate report: `flux submit --cc=1-100 --watch --progress --jps hostname`

Note that `--wait` is implied by `--watch`.

Of course, Flux can launch more than just single-node, single-core jobs.  We can submit multiple heterogeneous jobs and Flux will co-schedule the jobs while also ensuring no oversubscription of resources (e.g., cores).

Note: in this tutorial, we cannot assume that the host you are running on has multiple cores, thus the examples below only vary the number of nodes per job.  Varying the `cores-per-task` is also possible on Flux when the underlying hardware supports it (e.g., a multi-core node).

In [None]:
!flux submit --nodes=2 --ntasks=2 --cores-per-task=1 --job-name simulation sleep inf
!flux submit --nodes=1 --ntasks=1 --cores-per-task=1 --job-name analysis sleep inf

### ⭐️ New Command Alert! `flux watch` ⭐️

Wouldn't it be cool to submit a job and then watch it? Well, yeah! We can do this now with flux watch. Let's run a fun example, and then watch the output. We have sleeps in here interspersed with echos only to show you the live action! 🥞️
Also note a nice trick - you can always use `flux job last` to get the last JOBID.

In [None]:
!flux submit ../flux-workflow-examples/job-watch/job-watch.sh
!flux watch $(flux job last)

### Listing job properties with `flux jobs`

We can now list the jobs in the queue with `flux jobs` and we should see both jobs that we just submitted. Jobs that are instances are colored blue in output, red jobs are failed jobs, and green jobs are those that completed successfully. Note that the JupyterLab notebook may not display these colors. You will be able to see them in the terminal.

In [None]:
!flux jobs

Since those jobs won't ever exit (and we didn't specify a timelimit), let's kill them off now and free up the resources.

In [None]:
!flux job killall -f
!flux jobs

We can use the `flux batch` command to easily created nested flux instances.  When `flux batch` is invoked, Flux will automatically create a nested instance that spans the resources allocated to the job, and then Flux runs the batch script passed to `flux batch` on rank 0 of the nested instance. "Rank" refers to the rank of the Tree-Based Overlay Network (TBON) used by the Flux brokers: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man1/flux-broker.html

While a batch script is expected to launch parallel jobs using `flux run` or `flux submit` at this level, nothing prevents the script from further batching other sub-batch-jobs using the `flux batch` interface, if desired.

Note: Flux also provides a `flux alloc` which is an interactive version of `flux batch`, but demonstrating that in a Jupyter notebook is difficult due to the lack of pseudo-terminal.

In [None]:
!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh
!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh

The contents of `sleep_batch.sh`:

``` bash 
    #!/bin/bash
  
    echo "Starting my batch job"
    echo "Print the resources allocated to this batch job"
    flux resource list

    echo "Use sleep to emulate a parallel program"
    echo "Run the program at a total of 2 processes each requiring"
    echo "1 core. These processes are equally spread across 2 nodes."
    flux run -N 2 -n 2 sleep 30
    flux run -N 2 -n 2 sleep 30
```

In [None]:
# Here we are submitting a job that generates output, and asking to write it to /tmp/cheese.txt
!flux submit --out /tmp/cheese.txt echo "Sweet dreams 🌚️ are made of cheese, who am I to diss a brie? 🧀️"

# This will show us JOBIDs
!flux jobs

# You could copy a JOBID from above and paste it in the line below to examine the job's resources and output
# or get the last jobid with "flux job last" (this is what we will do here)
# JOBID="ƒFoRYVpt7"

# Note here we are using flux job last to see the last one
# The "R" here asks for the resource spec
!flux job info $(flux job last) R

# When we attach it will direct us to our output file
!flux job attach $(flux job last)

# And we can look at the output file to see our expected output!
!cat /tmp/cheese.txt

To list all completed jobs, run `flux jobs -a`:

In [None]:
!flux jobs -a

To restrict the output to failed (i.e., jobs that exit with nonzero exit code, time out, or are canceled or killed) jobs, run:

In [None]:
!flux jobs -f failed

# Flux Process and Job Utilities
## Flux top
Flux provides a feature-full version of `top` for nested Flux instances and jobs. In the JupyterLab terminal, invoke `flux top` to see the "sleep" jobs. If they have already completed you can resubmit them. 

We recommend not running `flux top` in the notebook as it is not designed to display output from a command that runs continuously.

## Flux pstree
In analogy to `top`, Flux provides `flux pstree`. Try it out in the JupyterLab terminal or here in the notebook.

## Flux proxy
### Interacting with a job hierarchy with `flux proxy`
#### Example 1
Routes messages to and from a Flux instance. We can use `flux proxy` to connect to a running Flux instance and then submit more nested jobs inside it. You may want to edit `sleep_batch.sh` with the JupyterLab text editor (double click the file in the window on the left) to sleep for `60` or `120` seconds. Then from the JupyterLab terminal, run: 

`flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh` # outputs JOBID

`flux pstree -x`

`flux proxy JOBID` # this connects you to the Flux instance corresponding to JOBID above

`flux uptime` # Note the depth is now 1 and the size is 2: we're one level deeper in a Flux hierarchy and we have only 2 brokers now.

`flux resource list` # This instance has 2 "nodes" and 2 cores allocated to it

`flux top`

#### Example 2
Contents of `sleeps.sh`:

``` bash 
#!/bin/bash

flux submit -N1 sleep 30
flux submit -N1 sleep 30
flux submit -N1 sleep 30
flux submit -N1 sleep 30
flux queue drain
```
Note `flux queue drain` which waits for the sub-jobs to complete.
`flux batch -N2 sleeps.sh`
(copy output JOBID)
`flux proxy JOBID`
`flux jobs`

#### Example 3
Here's an example of submitting jubs within a nested instance. You can run this example here in the notebook.

In [None]:
!cat sub_job1.sh
!cat sub_job2.sh

In [None]:
!flux batch -N1 ./sub_job1.sh

Here is how to try flux pstree, which normally can show jobs in an instance, but it has limited functionality given we are in a notebook! So instead of just running the single command, let's add "-a" to indicate "show me ALL jobs."
More complex jobs and in a different environment would have deeper nesting. You can [see examples here](https://flux-framework.readthedocs.io/en/latest/jobs/hierarchies.html?h=pstree#flux-pstree-command).


In [None]:
!flux pstree -a

## Submission API
Flux also provides first-class python bindings which can be used to submit jobs programmatically. The following script shows this with the `flux.job.submit()` call:

In [None]:
import os
import json
import flux
from flux.job import JobspecV1
from flux.job.JobID import JobID

In [None]:
f = flux.Flux() # connect to the running Flux instance
compute_jobreq = JobspecV1.from_command(
    command=["./compute.py", "120"], num_tasks=1, num_nodes=1, cores_per_task=1
) # construct a jobspec
compute_jobreq.cwd = os.path.expanduser("..//flux-workflow-examples/job-submit-api/") # set the CWD
print(JobID(flux.job.submit(f,compute_jobreq)).f58) # submit and print out the jobid (in f58 format)

In [None]:
!flux jobs -a | grep compute

Under the hood, the `Jobspec` class is creating a YAML document that ultimately gets serialized as JSON and sent to Flux for ingestion, validation, queueing, scheduling, and eventually execution.  We can dump the raw JSON jobspec that is submitted, where we can see the exact resources requested and the task set to be executed on those resources.

In [None]:
print(compute_jobreq.dumps(indent=2))

We can then replicate our previous example of submitting multiple heterogeneous jobs and testing that Flux co-schedules them.

In [None]:
compute_jobreq = JobspecV1.from_command(
    command=["./compute.py", "120"], num_tasks=4, num_nodes=2, cores_per_task=2
)
compute_jobreq.cwd = os.path.expanduser("../flux-workflow-examples/job-submit-api/")
print(JobID(flux.job.submit(f, compute_jobreq)))

io_jobreq = JobspecV1.from_command(
    command=["./io-forwarding.py", "120"], num_tasks=1, num_nodes=1, cores_per_task=1
)
io_jobreq.cwd = os.path.expanduser("../flux-workflow-examples/job-submit-api/")
print(JobID(flux.job.submit(f, io_jobreq)))

In [None]:
!flux jobs -a | grep compute

We can use the FluxExecutor class to submit large numbers of jobs to Flux. This method uses python's `concurrent.futures` interface.  Example snippet from `~/flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py`:

``` python 
with FluxExecutor() as executor:
        compute_jobspec = JobspecV1.from_command(args.command)
        futures = [executor.submit(compute_jobspec) for _ in range(args.njobs)]
        # wait for the jobid for each job, as a proxy for the job being submitted
        for fut in futures:
            fut.jobid()
        # all jobs submitted - print timings
```

In [None]:
# Submit a FluxExecutor based script.
%run ../flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py -n200 /bin/sleep 0

# Diving Deeper Into Flux's Internals

Flux uses [hwloc](https://github.com/open-mpi/hwloc) to detect the resources on each node and then to populate its resource graph.

You can access the topology information that Flux collects with the `flux resource` subcommand:

In [None]:
!flux resource list

Flux can also bootstrap its resource graph based on static input files, like in the case of a multi-user system instance setup by site administrators.  [More information on Flux's static resource configuration files](https://flux-framework.readthedocs.io/en/latest/adminguide.html#resource-configuration).  Flux provides a more standard interface to listing available resources that works regardless of the resource input source: `flux resource`.

In [None]:
# To view status of resources
!flux resource status

Flux has a command for controlling the queue within the `job-manager`: `flux queue`.  This includes disabling job submission, re-enabling it, waiting for the queue to become idle or empty, and checking the queue status:

In [None]:
!flux queue disable "maintenance outage"
!flux queue enable
!flux queue -h

Each Flux instance has a set of attributes that are set at startup that affect the operation of Flux, such as `rank`, `size`, and `local-uri` (the Unix socket usable for communicating with Flux).  Many of these attributes can be modified at runtime, such as `log-stderr-level` (1 logs only critical messages to stderr while 7 logs everything, including debug messages).

In [None]:
!flux getattr rank
!flux getattr size
!flux getattr local-uri
!flux setattr log-stderr-level 3
!flux lsattr -v

Services within a Flux instance are implemented by modules. To query and manage broker modules, use `flux module`.  Modules that we have already directly interacted with in this tutorial include `resource` (via `flux resource`), `job-ingest` (via `flux` and the Python API) `job-list` (via `flux jobs`) and `job-manager` (via `flux queue`), and we will interact with the `kvs` module in a few cells. For the most part, services are implemented by modules of the same name (e.g., `kvs` implements the `kvs` service and thus the `kvs.lookup` RPC).  In some circumstances, where multiple implementations for a service exist, a module of a different name implements a given service (e.g., in this instance, `sched-fluxion-qmanager` provides the `sched` service and thus `sched.alloc`, but in another instance `sched-simple` might provide the `sched` service).

In [None]:
!flux module list

We can actually unload the Fluxion modules (the scheduler modules from flux-sched) and replace them with `sched-simple` (the scheduler that comes built-into flux-core) as a demonstration of this functionality:

In [None]:
!flux module unload sched-fluxion-qmanager
!flux module unload sched-fluxion-resource
!flux module load sched-simple
!flux module list

We can now reload the Fluxion scheduler, but this time, let's pass some extra arguments to specialize our Flux instance.  In particular, let's populate our resource graph with nodes, sockets, and cores and limit the scheduling depth to 4.

In [None]:
!flux dmesg -C
!flux module unload sched-simple
!flux module load sched-fluxion-resource load-allowlist=node,socket,core
!flux module load sched-fluxion-qmanager queue-params=queue-depth=4
!flux module list
!flux dmesg | grep queue-depth

The key-value store (KVS) is a core component of a Flux instance. The `flux kvs` command provides a utility to list and manipulate values of the KVS. Modules of Flux use the KVS to persistently store information and retrieve it later on (potentially after a restart of Flux).  One example of KVS use by Flux is the `resource` module, which stores the resource set `R` of the current Flux instance:

In [None]:
!flux kvs ls 
!flux kvs ls resource
!flux kvs get resource.R | jq

Flux provides a built-in mechanism for executing commands on nodes without requiring a job or resource allocation: `flux exec`.  `flux exec` is typically used by sys admins to execute administrative commands and load/unload modules across multiple ranks simultaneously.

In [None]:
!flux exec -r 2 flux getattr rank # only execute on rank 2

In [None]:
!flux exec flux getattr rank # execute on all ranks

# This concludes the notebook tutorial. 😭️😄️

Don't worry, you'll have more opportunities for using Flux! We hope you reach out to us on any of our [project repositories](https://flux-framework.org) and ask any questions that you have. We'd love your contribution to code, documentation, or just saying hello! 👋️ If you have feedback on the tutorial, please let us know so we can improve it for next year. 

> But what do I do now?

Feel free to experiment more with Flux here, or (for more freedom) in the terminal. You can try more of the examples in the flux-workflow-examples directory one level up in the window to the left. If you're using a shared system like the one on the RADIUSS AWS tutorial please be mindful of other users and don't run compute intensive workloads. If you're running the tutorial in a job on an HPC cluster... compute away! ⚾️

> Where can I learn to set this up on my own?

If you're interested in installing Flux on your cluster, take a look at the [system instance instructions](https://flux-framework.readthedocs.io/en/latest/adminguide.html). If you are interested in running Flux on Kubernetes, check out the [Flux Operator](https://github.com/flux-framework/flux-operator). 

![https://flux-framework.org/flux-operator/_static/images/flux-operator.png](https://flux-framework.org/flux-operator/_static/images/flux-operator.png)

> See you next year! 👋️😎️