New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding flux executor #1810
Conversation
eb0dff2
to
f132f1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, flux looks great! I am eager to try it on our local cluster here. I have some comments below.
snakemake/executors/flux.py
Outdated
|
||
def format_job_exec(self, job): | ||
""" | ||
We don't want a python -m to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Is there any downside? If this is really needed to be different, rather make this specific part a separate overwritable method in the AbstractExecutor, instead of copying the entire method from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but in my example the python -m didn't work, but running snakemake did. I think the issue I was running into is having a different flux install (provided in the container) with the one that works with snakemake (via miniconda). I couldn't get them working from the same one - so python -m won't work (the python does not know about snakemake as it's not installed there) but calling the executable directly does. This is likely an artifact of the container I'm using (already built by the flux crew) but if I added python -m it won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, I think we should try to understand that. So, how is that with flux. Do jobs always run inside of a container, or only in case of cloud mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flux is a resource manager similar to Slurm, PBS, Torque, etc in that it is designed to run across a set of resources, as on a cluster for example. However, it can also be started as a job within one of those resource managers (including itself), to give full RM capabilities to the job. It can also be started standalone on a node, VM, or container to help manage local resources.
I guess to answer more specifically, Flux is not typically used inside a container, but nothing prevents the user from doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default flux propagates the entire environment captured where flux run
or similar was invoked, because that tends to be what a user wants, but it provides a variety of ways to cur that down or override it as well. It's situational enough that I would think we'd probably want to do what a snakemake user would expect, whatever that is.
To answer the earlier question, flux doesn't run jobs in containers, though that could be supported. It sounds like there were two python environments installed, and the one with flux bindings is different from the one with snakemake installed. We could look at making it easier to install the flux binding components into the same environment, or snakemake into the one flux is using to get around the python -m
problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trws that would be fantastic - is this something I could help with? I don't remember the exact issues, but I couldn't get snakemake installed into the default Python in the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you probably could, we should chat about that separately. It would take some re-packaging of the flux python binding component so it can build separately from the autotools-based build of flux to make it convenient. That's something we have several customers asking for, so if you're up for it that would be really easy to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - would love to help with that! Will ping you tomorrow (and leave me a message in chat if you want to show me stuff sooner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any progress with this? I still would like to get rid of this repetition of superclass code.
Would be nice to set up a flux instance in some github action before the actual snakemake tests. |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
…ime in seconds Signed-off-by: vsoch <vsoch@users.noreply.github.com>
15317ab
to
251b178
Compare
try being the root user and switching to fluxuser Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
98a2b9a
to
0a14a16
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
woooo @johanneskoester 38th time is the charm!! I got the flux workflow running in the test!! 🥳 |
Sorry for using all the CI cycles 😆 |
Nice! |
So, @vsoch anything still missing here or is it ready for a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry it took me so long to get around to this. A few minor comments below, feel free to ignore if you think they aren't worthwhile.
if not flux: | ||
sys.exit( | ||
"Cannot import flux. Is a cluster available to you with Python bindings?" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about snakemake
but usually sys.exit
is used (if at all) in extreme scenarios. I think it might be more 'polite' to throw an ImportError
, either directly or by running import flux
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - it should raise a workflow error - fixing shortly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is updated to a WorkflowError - I think it could also be an ImportError, but it does reflect an issue with the workflow running. (posting in correct spot)
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Sonarbot be like "stop pushing @vsoch I was already running your tests!" |
I think Windows was a transient error - triggered a re-run to test that. |
@johanneskoester I think this is ready to go, @trws said yesterday it looked good. |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
okay @jameshcorbett you were right that we don't need the import there - I'm not sure if it was something about the job formatted, but it's gone now so moving to the top works! I also removed a huge chunk of code that we needed before futures (but no longer need). I've tested the updates locally and all is working. |
oh no snakemake why you doing these random failures!! I'll try a re-run a bit later. |
@johanneskoester we are updated and green!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final stuff below. Sorry for the delay!
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
snakemake/executors/flux.py
Outdated
|
||
def format_job_exec(self, job): | ||
""" | ||
We don't want a python -m to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any progress with this? I still would like to get rid of this repetition of superclass code.
@johanneskoester I don't feel super strongly about this - if you just want it moved to the init I'd be happy to do that! |
Oh wait, this was different: def format_job_exec(self, job):
"""
We don't want a python -m to run.
"""
prefix = self.get_job_exec_prefix(job)
if prefix:
prefix += " &&"
suffix = self.get_job_exec_suffix(job)
if suffix:
suffix = f"&& {suffix}"
return join_cli_args(
[
prefix,
self.get_envvar_declarations(),
"snakemake",
format_cli_arg("--snakefile", self.get_snakefile()),
self.get_job_args(job),
self.general_args,
suffix,
]
) I couldn't get this to work without this function because we can't be sure that the Flux Python is installed to the same python install as snakemake. With the way Flux is installed now, it's more likely to be installed to a system python (and the user will be using their own Python environment). If we do python -m then (from my testing) the job is going to fail. By just calling snakemake directly, we don't require the python where the flux is installed to be the same as where the snakemake binary is. I tried to get this working the other way and failed multiple times. |
I don't really get this. So there is the flux python bindings, which apparently you can install in the same env as snakemake (and have to, since otherwise your executor code would not even work). But the job which is executed by flux, is just called from the environment where flux itself (like their core system, whatever) is installed into? And there is no way to tell flux about the execution env the job shall run in? Or configure that somehow? Also see here. Snakemake does not simply call any python by default, but rather wants to reuse the python executable of the host process by default. Of course, this only works in a shared filesystem like case (hence for k8s or GLS we use a different approach, but there we actually control the container the job is executed in). Not if flux spawns jobs in separate containers. So we really need to get clear how jobs are executed by flux. As far as I understood the docs so far, it is (at least for now) indeed acting very similar to slurm, based on a shared FS and so on. Then, just using the superclass method should just work. So, is the problem indeed just the test setup with the container? Then, we could probably just install flux into the github actions, thereby more closely mimicking a real setup? |
The only way I could get this to work (in the CI and the container) is to install snakemake to the same python as flux, having them separate never worked. I'll prepare you an example that shows the error (or proves me wrong and it works) and then maybe you can make a suggestion as to what to try next.
I've never tried this, but I'd advocate to use the container so 1. a user can reproduce, and 2. we can always be sure to use the latest flux container provided (and not fall out of sync / require updating two things). More ideally we'd have a GitHub action that installs flux (akin to the container) and then we can specify the Python to install to, but I haven't made that yet. |
I think this might be an issue with the Python install - let me test something out and see if it resolves the issue - I forgot that these containers have python3 (but not python). Be back in a bit! |
and python3->python link is required! Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Okay just pushed a change without it - I think the underlying issue was not having python exposed (and I didn't interpret the error message correctly the first time). I'm also thinking it would be hugely useful to have native flux install actions - going to work on that soon in case we want to use here instead of the container! |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Kudos, SonarCloud Quality Gate passed! |
Thanks @johanneskoester ! And apologies if the docs aren't clear - it's good feedback indeed. |
🤖 I have created a release *beep* *boop* --- ## [7.15.0](v7.14.2...v7.15.0) (2022-10-04) ### Features * adding flux executor ([#1810](#1810)) ([40d2bd0](40d2bd0)) ### Bug Fixes * Add back logging of run directives ([#1883](#1883)) ([a65559c](a65559c)) ### Documentation * fix grammar in the intro ([#1859](#1859)) ([774bc6a](774bc6a)) * fix typo ([#1843](#1843)) ([6572ad9](6572ad9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Not that I know of, if it covers what snakemake needs I'm good.
|
Signed-off-by: vsoch vsoch@users.noreply.github.com
Description
This will add the Flux Executor to snakemake. Flux is a very programmatic job manager (maybe that's too limited a term) but it can be run in the context of HPC or cloud! I've written a containerized tutorial so reviewers can test it out. Also pinging @trws and @grondo to take a look.
@johanneskoester since this requires a new executor backend, let me know how you'd like to test. We do have a container so we could try a strategy that uses that.
QC
docs/
) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).