Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding flux executor #1810

Merged
merged 33 commits into from Oct 4, 2022
Merged

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Aug 12, 2022

Signed-off-by: vsoch vsoch@users.noreply.github.com

Description

This will add the Flux Executor to snakemake. Flux is a very programmatic job manager (maybe that's too limited a term) but it can be run in the context of HPC or cloud! I've written a containerized tutorial so reviewers can test it out. Also pinging @trws and @grondo to take a look.

@johanneskoester since this requires a new executor backend, let me know how you'd like to test. We do have a container so we could try a strategy that uses that.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

@vsoch vsoch changed the title Adding flux executor feat: adding flux executor Aug 12, 2022
@vsoch vsoch force-pushed the add/flux-executor branch 3 times, most recently from eb0dff2 to f132f1d Compare August 13, 2022 02:58
Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, flux looks great! I am eager to try it on our local cluster here. I have some comments below.

docs/executor_tutorial/flux.rst Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved

def format_job_exec(self, job):
"""
We don't want a python -m to run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Is there any downside? If this is really needed to be different, rather make this specific part a separate overwritable method in the AbstractExecutor, instead of copying the entire method from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but in my example the python -m didn't work, but running snakemake did. I think the issue I was running into is having a different flux install (provided in the container) with the one that works with snakemake (via miniconda). I couldn't get them working from the same one - so python -m won't work (the python does not know about snakemake as it's not installed there) but calling the executable directly does. This is likely an artifact of the container I'm using (already built by the flux crew) but if I added python -m it won't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, I think we should try to understand that. So, how is that with flux. Do jobs always run inside of a container, or only in case of cloud mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flux isn't typically in a container I don't think - this is just the dummy example that I'm testing/developing with! Ping @trws and @grondo I think he can add more insight (I haven't yet used flux outside of this single container demo environment).

Copy link

@grondo grondo Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flux is a resource manager similar to Slurm, PBS, Torque, etc in that it is designed to run across a set of resources, as on a cluster for example. However, it can also be started as a job within one of those resource managers (including itself), to give full RM capabilities to the job. It can also be started standalone on a node, VM, or container to help manage local resources.

I guess to answer more specifically, Flux is not typically used inside a container, but nothing prevents the user from doing that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default flux propagates the entire environment captured where flux run or similar was invoked, because that tends to be what a user wants, but it provides a variety of ways to cur that down or override it as well. It's situational enough that I would think we'd probably want to do what a snakemake user would expect, whatever that is.

To answer the earlier question, flux doesn't run jobs in containers, though that could be supported. It sounds like there were two python environments installed, and the one with flux bindings is different from the one with snakemake installed. We could look at making it easier to install the flux binding components into the same environment, or snakemake into the one flux is using to get around the python -m problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trws that would be fantastic - is this something I could help with? I don't remember the exact issues, but I couldn't get snakemake installed into the default Python in the container.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you probably could, we should chat about that separately. It would take some re-packaging of the flux python binding component so it can build separately from the autotools-based build of flux to make it convenient. That's something we have several customers asking for, so if you're up for it that would be really easy to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - would love to help with that! Will ping you tomorrow (and leave me a message in chat if you want to show me stuff sooner).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any progress with this? I still would like to get rid of this repetition of superclass code.

snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
@johanneskoester
Copy link
Contributor

@johanneskoester since this requires a new executor backend, let me know how you'd like to test. We do have a container so we could try a strategy that uses that.

Would be nice to set up a flux instance in some github action before the actual snakemake tests.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
…ime in seconds

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Aug 19, 2022

I'm going to need the wisdom of the flux folks here - @grondo when you have time can you share what a flux start start.sh script might look like? @trws mentioned this could be something to try so it doesn't go into interactive mode! So far my GHA tricks to allow it haven't worked.

try being the root user and switching to fluxuser

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Aug 19, 2022

woooo @johanneskoester 38th time is the charm!! I got the flux workflow running in the test!! 🥳

@vsoch
Copy link
Contributor Author

vsoch commented Aug 19, 2022

Sorry for using all the CI cycles 😆

@johanneskoester
Copy link
Contributor

Nice!

@johanneskoester
Copy link
Contributor

So, @vsoch anything still missing here or is it ready for a final review?

@vsoch
Copy link
Contributor Author

vsoch commented Aug 27, 2022

Ping @grondo and @trws see above - is this ready or anything missing?

Copy link

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry it took me so long to get around to this. A few minor comments below, feel free to ignore if you think they aren't worthwhile.

examples/flux/Dockerfile Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
if not flux:
sys.exit(
"Cannot import flux. Is a cluster available to you with Python bindings?"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about snakemake but usually sys.exit is used (if at all) in extreme scenarios. I think it might be more 'polite' to throw an ImportError, either directly or by running import flux.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - it should raise a workflow error - fixing shortly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is updated to a WorkflowError - I think it could also be an ImportError, but it does reflect an issue with the workflow running. (posting in correct spot)

snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Show resolved Hide resolved
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Sep 6, 2022

Sonarbot be like "stop pushing @vsoch I was already running your tests!"

@vsoch
Copy link
Contributor Author

vsoch commented Sep 6, 2022

I think Windows was a transient error - triggered a re-run to test that.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 17, 2022

@johanneskoester I think this is ready to go, @trws said yesterday it looked good.

vsoch and others added 2 commits September 19, 2022 14:51
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Sep 19, 2022

okay @jameshcorbett you were right that we don't need the import there - I'm not sure if it was something about the job formatted, but it's gone now so moving to the top works!

I also removed a huge chunk of code that we needed before futures (but no longer need). I've tested the updates locally and all is working.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 19, 2022

oh no snakemake why you doing these random failures!! I'll try a re-run a bit later.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 20, 2022

@johanneskoester we are updated and green!!!

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final stuff below. Sorry for the delay!

snakemake/__init__.py Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
snakemake/executors/flux.py Outdated Show resolved Hide resolved
Signed-off-by: vsoch <vsoch@users.noreply.github.com>

def format_job_exec(self, job):
"""
We don't want a python -m to run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any progress with this? I still would like to get rid of this repetition of superclass code.

.github/workflows/test-flux.yaml Show resolved Hide resolved
@vsoch
Copy link
Contributor Author

vsoch commented Oct 3, 2022

Was there any progress with this? I still would like to get rid of this repetition of superclass code.

@johanneskoester I don't feel super strongly about this - if you just want it moved to the init I'd be happy to do that!

@vsoch
Copy link
Contributor Author

vsoch commented Oct 3, 2022

Oh wait, this was different:

    def format_job_exec(self, job):
        """
        We don't want a python -m to run.
        """
        prefix = self.get_job_exec_prefix(job)
        if prefix:
            prefix += " &&"
        suffix = self.get_job_exec_suffix(job)
        if suffix:
            suffix = f"&& {suffix}"
        return join_cli_args(
            [
                prefix,
                self.get_envvar_declarations(),
                "snakemake",
                format_cli_arg("--snakefile", self.get_snakefile()),
                self.get_job_args(job),
                self.general_args,
                suffix,
            ]
        )

I couldn't get this to work without this function because we can't be sure that the Flux Python is installed to the same python install as snakemake. With the way Flux is installed now, it's more likely to be installed to a system python (and the user will be using their own Python environment). If we do python -m then (from my testing) the job is going to fail. By just calling snakemake directly, we don't require the python where the flux is installed to be the same as where the snakemake binary is.

I tried to get this working the other way and failed multiple times.

@johanneskoester
Copy link
Contributor

johanneskoester commented Oct 3, 2022

Oh wait, this was different:

    def format_job_exec(self, job):
        """
        We don't want a python -m to run.
        """
        prefix = self.get_job_exec_prefix(job)
        if prefix:
            prefix += " &&"
        suffix = self.get_job_exec_suffix(job)
        if suffix:
            suffix = f"&& {suffix}"
        return join_cli_args(
            [
                prefix,
                self.get_envvar_declarations(),
                "snakemake",
                format_cli_arg("--snakefile", self.get_snakefile()),
                self.get_job_args(job),
                self.general_args,
                suffix,
            ]
        )

I couldn't get this to work without this function because we can't be sure that the Flux Python is installed to the same python install as snakemake. With the way Flux is installed now, it's more likely to be installed to a system python (and the user will be using their own Python environment). If we do python -m then (from my testing) the job is going to fail. By just calling snakemake directly, we don't require the python where the flux is installed to be the same as where the snakemake binary is.

I tried to get this working the other way and failed multiple times.

I don't really get this. So there is the flux python bindings, which apparently you can install in the same env as snakemake (and have to, since otherwise your executor code would not even work). But the job which is executed by flux, is just called from the environment where flux itself (like their core system, whatever) is installed into? And there is no way to tell flux about the execution env the job shall run in? Or configure that somehow?

Also see here. Snakemake does not simply call any python by default, but rather wants to reuse the python executable of the host process by default. Of course, this only works in a shared filesystem like case (hence for k8s or GLS we use a different approach, but there we actually control the container the job is executed in). Not if flux spawns jobs in separate containers. So we really need to get clear how jobs are executed by flux. As far as I understood the docs so far, it is (at least for now) indeed acting very similar to slurm, based on a shared FS and so on. Then, just using the superclass method should just work.

So, is the problem indeed just the test setup with the container? Then, we could probably just install flux into the github actions, thereby more closely mimicking a real setup?

@vsoch
Copy link
Contributor Author

vsoch commented Oct 3, 2022

I don't really get this. So there is the flux python bindings, which apparently you can install in the same env as snakemake (and have to, since otherwise your executor code would not even work). But the job which is executed by flux, is just called from the environment where flux itself (like their core system, whatever) is installed into? And there is no way to tell flux about the execution env the job shall run in? Or configure that somehow?

The only way I could get this to work (in the CI and the container) is to install snakemake to the same python as flux, having them separate never worked. I'll prepare you an example that shows the error (or proves me wrong and it works) and then maybe you can make a suggestion as to what to try next.

So, is the problem indeed just the test setup with the container? Then, we could probably just install flux into the github actions, thereby more closely mimicking a real setup?

I've never tried this, but I'd advocate to use the container so 1. a user can reproduce, and 2. we can always be sure to use the latest flux container provided (and not fall out of sync / require updating two things).

More ideally we'd have a GitHub action that installs flux (akin to the container) and then we can specify the Python to install to, but I haven't made that yet.

@vsoch
Copy link
Contributor Author

vsoch commented Oct 3, 2022

I think this might be an issue with the Python install - let me test something out and see if it resolves the issue - I forgot that these containers have python3 (but not python). Be back in a bit!

and python3->python link is required!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Oct 3, 2022

Okay just pushed a change without it - I think the underlying issue was not having python exposed (and I didn't interpret the error message correctly the first time). I'm also thinking it would be hugely useful to have native flux install actions - going to work on that soon in case we want to use here instead of the container!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@sonarcloud
Copy link

sonarcloud bot commented Oct 3, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@johanneskoester johanneskoester merged commit 40d2bd0 into snakemake:main Oct 4, 2022
@vsoch vsoch deleted the add/flux-executor branch October 4, 2022 20:19
@vsoch
Copy link
Contributor Author

vsoch commented Oct 4, 2022

Thanks @johanneskoester ! And apologies if the docs aren't clear - it's good feedback indeed.

johanneskoester pushed a commit that referenced this pull request Oct 4, 2022
🤖 I have created a release *beep* *boop*
---


##
[7.15.0](v7.14.2...v7.15.0)
(2022-10-04)


### Features

* adding flux executor
([#1810](#1810))
([40d2bd0](40d2bd0))


### Bug Fixes

* Add back logging of run directives
([#1883](#1883))
([a65559c](a65559c))


### Documentation

* fix grammar in the intro
([#1859](#1859))
([774bc6a](774bc6a))
* fix typo ([#1843](#1843))
([6572ad9](6572ad9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@trws
Copy link

trws commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants