Skip to content

Commit

Permalink
get through steps of providing command for final burst
Browse files Browse the repository at this point in the history
For the time being while we are developing, it is nice to have all the configs
generated and then a startscript generated we can manually run. This commit
adds all these steps, and next I need to move it onto an allocation and test
how to actually turn on the other hosts.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Jul 14, 2023
1 parent 34126a6 commit 1fc934f
Show file tree
Hide file tree
Showing 7 changed files with 258 additions and 63 deletions.
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ in the context of simply starting a set of nodes that are alongside one another
in an allocation.

For instructions, see the [main flux-burst repository](https://github.com/converged-computing/flux-burst).
Tutorials are available under the [flux operator](https://github.com/flux-framework/flux-operator/tree/main/examples/experimental/bursting)

![https://raw.githubusercontent.com/converged-computing/flux-burst/main/docs/assets/img/logo.png](https://raw.githubusercontent.com/converged-computing/flux-burst/main/docs/assets/img/logo.png)

Expand Down
102 changes: 100 additions & 2 deletions example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ lead broker. We are choosing this design because likely a local burst will need
to do (and automate) both steps.

```bash
# Ensure the installed executable is on your path
export PATH=$HOME/.local/bin:$PATH

# If you are using a custom flux install
$ python burst-slurm-allocation.py --config-dir ./configs --flux-root /path/to/flux/root --network-device eno1

Expand All @@ -22,6 +25,101 @@ $ python burst-slurm-allocation.py --config-dir ./configs --network-device eno1
# Development with one node (e.g., DevContainer)
$ python3 burst-slurm-allocation.py --config-dir ./configs --network-device eno1 --hostnames $(hostname)
```
```
🌳️ Flux root set to /usr
🦩️ Writing flux config to /workspaces/flux-burst-local/example/configs/system/system.toml
🌀️ Done! Use the following command to start your Flux instance and burst!
It is also written to /workspaces/flux-burst-local/example/configs/start.sh
/usr/bin/flux start --broker-opts --config /workspaces/flux-burst-local/example/configs -Stbon.fanout=256 -Srundir=/workspaces/flux-burst-local/example/configs/run -Sstatedir=/workspaces/flux-burst-local/example/configs/run -Slocal-uri=local:///workspaces/flux-burst-local/example/configs/run/local -Slog-stderr-level=7 -Slog-stderr-mode=local /home/vscode/.local/bin/flux-burst-local --config-dir /workspaces/flux-burst-local/example/configs --flux-root /usr
```
The above script is going to setup the configs and give you a command that will use them
to start a flux instance, and then also start a more standard flux burst plugin flow (with the same
local configs) so you can burst to local instances. Note that the flux root should have lib, bin, libexec, etc. in it. It's the `--prefix`
you chose for the install. Here is what the generated tree looks like under configs:

```bash
tree ./configs
```
```console
$ tree example/configs/
example/configs/
├── curve.cert
├── R
├── run
│   └── content.sqlite
├── start.sh
└── system
└── system.toml

2 directories, 5 files
```
The sockets (e.g., "local") will be generated under run. And here is what it looks like without the
secondary brokers starting yet:

```console
$ bash configs/start.sh
broker.debug[0]: insmod connector-local
broker.info[0]: start: none->join 0.6186ms
broker.info[0]: parent-none: join->init 0.034561ms
connector-local.debug[0]: allow-guest-user=false
connector-local.debug[0]: allow-root-owner=false
broker.debug[0]: insmod barrier
broker.debug[0]: insmod content-sqlite
content-sqlite.debug[0]: /workspaces/flux-burst-local/example/configs/run/content.sqlite (0 objects) journal_mode=WAL synchronous=NORMAL
broker.debug[0]: content backing store: enabled content-sqlite
broker.debug[0]: insmod kvs
broker.debug[0]: insmod kvs-watch
broker.debug[0]: insmod resource
resource.debug[0]: reslog_cb: resource-init event posted
resource.debug[0]: reslog_cb: resource-define event posted
broker.debug[0]: insmod cron
cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
broker.debug[0]: insmod job-manager
job-manager.debug[0]: jobtap plugin .history registered method job-manager.history.get
job-manager.info[0]: restart: 0 jobs
job-manager.info[0]: restart: 0 running jobs
job-manager.info[0]: restart: checkpoint.job-manager not found
job-manager.debug[0]: restart: max_jobid=ƒ1
job-manager.debug[0]: duration-validator: updated expiration to 0.00
broker.debug[0]: insmod job-info
broker.debug[0]: insmod job-list
job-list.debug[0]: job_state_init_from_kvs: read 0 jobs
broker.debug[0]: insmod job-ingest
job-ingest.debug[0]: configuring validator with plugins=(null), args=(null) (enabled)
job-ingest.debug[0]: fluid ts=1ms
broker.debug[0]: insmod job-exec
job-exec.debug[0]: using default shell path /usr/libexec/flux/flux-shell
broker.debug[0]: insmod heartbeat
broker.info[0]: rc1.0: running /etc/flux/rc1.d/01-sched-fluxion
broker.debug[0]: insmod sched-fluxion-resource
sched-fluxion-resource.info[0]: version 0.27.0-38-ge0b49993
sched-fluxion-resource.debug[0]: mod_main: resource module starting
sched-fluxion-resource.warning[0]: create_reader: allowlist unsupported
sched-fluxion-resource.debug[0]: resource graph datastore loaded with rv1exec reader
sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire
sched-fluxion-resource.debug[0]: resource status changed (rankset=[all] status=DOWN)
sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded
broker.debug[0]: insmod sched-fluxion-qmanager
sched-fluxion-qmanager.info[0]: version 0.27.0-38-ge0b49993
sched-fluxion-qmanager.debug[0]: service_register
sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
job-manager.debug[0]: scheduler: hello
job-manager.debug[0]: scheduler: ready unlimited
sched-fluxion-qmanager.debug[0]: handshaking with job-manager completed
broker.info[0]: rc1.0: running /etc/flux/rc1.d/02-cron
broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.4s
broker.info[0]: rc1-success: init->quorum 0.3982s
broker.debug[0]: groups: broker.online=0
broker.info[0]: online: c35948d1ed31 (ranks 0)
broker.info[0]: quorum-full: quorum->run 0.100979s
resource.debug[0]: reslog_cb: online event posted
sched-fluxion-resource.debug[0]: resource status changed (rankset=[0] status=UP)
TODO START OTHER WOKRERS
...
```

Note that the flux root should have lib, bin, libexec, etc. in it. It's the `--prefix`
you chose for the install.
I wasn't able to get an allocation so I'll develop this tomorrow.
31 changes: 3 additions & 28 deletions example/burst-slurm-allocation.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,36 +58,11 @@ def main():
# {'gke': <module 'fluxburst_gke' from '/home/flux/.local/lib/python3.8/site-packages/fluxburst_gke/__init__.py'>}

# Load our plugin and provide the dataclass to it!
# Unlike other plugins, the local one handles setting up the flux instance
# (and then issuing the burst). This could change (e.g., if we have already)
# generated configs or started the cluster.
client.load("local", params)

# Sanity check loaded
print(f"flux-burst client is loaded with plugins for: {client.choices}")

# We are using the default algorithms to filter the job queue and select jobs.
# If we weren't, we would add them via:
# client.set_ordering()
# client.set_selector()

# Here is how we can see the jobs that are contenders to burst!
# client.select_jobs()

# Now let's run the burst! The active plugins will determine if they
# are able to schedule a job, and if so, will do the work needed to
# burst. unmatched jobs (those we weren't able to schedule) are
# returned, maybe to do something with? Note that the default mock
# generates a N=4 job. For compute engine that will be 3 compute
# nodes and 1 login node.
unmatched = client.run_burst()
assert not unmatched
plugin = client.plugins["compute_engine"]
print(
f"Terraform configs and working directory are found at {plugin.params.terraform_dir}"
)
input("Press Enter to when you are ready to destroy...")

# Get a handle to the plugin so we can cleanup!
plugin.cleanup()


if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion fluxburst_local/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@ def init(dataclass, **kwargs):
this means starting another flux instance with the resources.
If SLURM we assume we are inside a SLURM allocation.
"""
from .plugin import FluxBurstSlurm, SlurmBurstParameters
from .plugin import FluxBurstLocal, FluxBurstSlurm, SlurmBurstParameters

if isinstance(dataclass, SlurmBurstParameters):
# Set variables from slurm
FluxBurstSlurm.setup(dataclass)
return FluxBurstSlurm(dataclass, **kwargs)
return FluxBurstLocal(dataclass, **kwargs)
75 changes: 75 additions & 0 deletions fluxburst_local/flux.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env python

# This is a script called by flux-burst-local, and it's assumed that files
# are generated in the directed config directory, and we've started the
# main broker and can now burst.

import argparse

from fluxburst.client import FluxBurst

# How we provide custom parameters to a flux-burst plugin
from fluxburst_local.plugin import BurstParameters


def get_parser():
parser = argparse.ArgumentParser(
description="Flux Local Broker Start",
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument("--config-dir", help="Configuration directory for flux")
parser.add_argument(
"--flux-root", help="Flux root (should correspond with broker running Flux)"
)
return parser


def main():
parser = get_parser()
args, _ = parser.parse_known_args()

# Create the dataclass for the plugin config
# We use a dataclass because it does implicit validation of required params, etc.
params = BurstParameters(
flux_root=args.flux_root,
config_dir=args.config_dir,
# This says to not re-generate our configs!
regenerate=False,
)
assert params
client = FluxBurst()

# For debugging, here is a way to see plugins available
# import fluxburst.plugins as plugins
# print(plugins.burstable_plugins)
print("TODO START OTHER WOKRERS")

# Load our plugin and provide the dataclass to it!
# client.load("local", params)

# Sanity check loaded
client = FluxBurst()
print(f"flux-burst client is loaded with plugins for: {client.choices}")

# We are using the default algorithms to filter the job queue and select jobs.
# If we weren't, we would add them via:
# client.set_ordering()
# client.set_selector()

# Here is how we can see the jobs that are contenders to burst!
# client.select_jobs()

# Now let's run the burst! The active plugins will determine if they
# are able to schedule a job, and if so, will do the work needed to
# burst. unmatched jobs (those we weren't able to schedule) are
# returned, maybe to do something with? Note that the default mock
# generates a N=4 job. For compute engine that will be 3 compute
# nodes and 1 login node.
unmatched = client.run_burst()
assert not unmatched
plugin = client.plugins["local"]
print(plugin)


if __name__ == "__main__":
main()

0 comments on commit 1fc934f

Please sign in to comment.