Skip to content

Commit

Permalink
feat: support for external executor plugins (#2305)
Browse files Browse the repository at this point in the history
Hi @johanneskoester! 👋 As we chat about in a thread somewhere, I
think it would be really powerful to allow for installing (and
discovering) external plugins to Snakemake. Specifically for the Flux
Operator, I have easily three designs I'm testing, and it's not really
appropriate to add them proper to snakemake - but I believe the
developer user should be empowered to flexibly add/remove and test them
out.

This pull request is first try demo of how snakemake could allow
external executor plugins. I say "first try" because it's the first time
I've experimented with plugins, and I tried to choose a design that
optimizes simplicity and flexibility without requiring external
packages, or specific features of setuptools or similar (that are likely
to change). The basic design here uses pkgutil to discover
snakemake_executor_* plugins, and then provides them to the client (to
add arguments) and scheduler to select using one with `--executor`.

I've written up an entire tutorial and the basic design in this early
prototype, which is basically the current Flux integration as a plugin!
https://github.com/snakemake/snakemake-executor-flux. The user would
basically do:

```bash
# Assuming this was released on pypi (it's not yet)
$ pip install snakemake-executor-flux

# Run the workflow using the flux custom executor
$ snakemake --jobs 1 --executor flux
```
I've designed it so that plugins are validated only when chosen, and
each plugin can add or otherwise customize the parser, and then (after
parsing) further tweak the args if chosen. Then in scheduler.py, we
simply look if the user selected a plugin, and call the main executor
(and local_executor) classes if this is the case.

The one hard piece is having a flexible way to pass forward all those
custom arguments. The current snakemake design has basically a custom
boolean for every executor hard coded (e.g., `--flux` or `--slurm`) and
while we don't want to blow that up, I'm worried moving forward passing
all these custom namespaced arguments through the init, workflow,
scheduler/dag, is going to get very messy. So the approach here is a
suggested way to handle the expanding space of additional executors by
way of passing forward full args, and then allowing the plugins to
customize the parser before or after. If we were to, for example, turn
current executors into plugins (something I expect we might want to do
for the Google Life Sciences API that is going to be deprecated in favor
of batch) we could write out a more hardened spec - some configuration
class that is passed from the argument parser through the executor and
through execution (instead of all the one off arguments).

Anyway - this is just a first shot and I'm hoping to start some
discussion! This is a totally separate thing from TBA work with Google
Batch - this is something that I've wanted to try for a while as I've
wanted to add more executors and have seen the executor space exploding.
:laughing: I haven't written tests or updated any docs yet pending our
discussion!

### QC

* [ ] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [ ] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).

---------

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Co-authored-by: vsoch <vsoch@users.noreply.github.com>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
  • Loading branch information
4 people committed Aug 7, 2023
1 parent 1c5d154 commit c9eaa4e
Show file tree
Hide file tree
Showing 51 changed files with 3,834 additions and 5,224 deletions.
10 changes: 8 additions & 2 deletions .github/workflows/main.yml
Expand Up @@ -43,8 +43,7 @@ jobs:

testing:
runs-on: ubuntu-latest
# TODO comment in once slurm works again
# needs: formatting
needs: formatting
services:
mysql:
image: mysql:8.0
Expand Down Expand Up @@ -169,8 +168,12 @@ jobs:
shell: bash -el {0}
run: |
conda config --set channel_priority strict
mamba env update -q -n snakemake --file test-environment.yml
# TODO remove and add as regular dependency once released
pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
# additionally add singularity
# TODO remove version constraint: needed because 3.8.7 fails with missing libz:
Expand Down Expand Up @@ -310,6 +313,9 @@ jobs:
run: |
conda config --set channel_priority strict
mamba env update -q --file test-environment.yml
# TODO remove and add as regular dependency once released
pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
- name: Run tests
env:
CI: true
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/test-flux.yaml
Expand Up @@ -40,6 +40,8 @@ jobs:
run: |
conda config --set channel_priority strict
mamba install python>=3.9 pip
# TODO remove and add as regular dependency once released
pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
pip install .
- name: Start Flux and Test Workflow
Expand Down
2 changes: 0 additions & 2 deletions docs/project_info/contributing.rst
Expand Up @@ -57,7 +57,6 @@ Below you find a skeleton
quiet=False,
printshellcmds=False,
latency_wait=3,
cluster_config=None,
local_input=None,
restart_times=None,
exec_job=None,
Expand All @@ -70,7 +69,6 @@ Below you find a skeleton
quiet=quiet,
printshellcmds=printshellcmds,
latency_wait=latency_wait,
cluster_config=cluster_config,
local_input=local_input,
restart_times=restart_times,
assume_shared_fs=False, # if your executor relies on a shared file system, set this to True
Expand Down
94 changes: 9 additions & 85 deletions docs/snakefiles/configuration.rst
Expand Up @@ -206,91 +206,6 @@ Validating PEPs

Using the ``pepschema`` directive leads to an automatic parsing of the provided schema *and* PEP validation with the PEP validation tool -- `eido <http://eido.databio.org>`_. Eido schemas extend `JSON Schema <https://json-schema.org>`_ vocabulary to accommodate the powerful PEP features. Follow the `How to write a PEP schema <http://eido.databio.org/en/latest/writing-a-schema>`_ guide to learn more.

.. _snakefiles-cluster_configuration:

----------------------------------
Cluster Configuration (deprecated)
----------------------------------

While still being possible, **cluster configuration has been deprecated** by the introduction of :ref:`profiles`.

Snakemake supports a separate configuration file for execution on a cluster.
A cluster config file allows you to specify cluster submission parameters outside the Snakefile.
The cluster config is a JSON- or YAML-formatted file that contains objects that match names of rules in the Snakefile.
The parameters in the cluster config are then accessed by the ``cluster.*`` wildcard when you are submitting jobs.
Note that a workflow shall never depend on a cluster configuration, because this would limit its portability.
Therefore, it is also not intended to access the cluster configuration from **within** the workflow.

For example, say that you have the following Snakefile:

.. code-block:: python
rule all:
input: "input1.txt", "input2.txt"
rule compute1:
output: "input1.txt"
shell: "touch input1.txt"
rule compute2:
output: "input2.txt"
shell: "touch input2.txt"
This Snakefile can then be configured by a corresponding cluster config, say "cluster.json":


.. code-block:: json
{
"__default__" :
{
"account" : "my account",
"time" : "00:15:00",
"n" : 1,
"partition" : "core"
},
"compute1" :
{
"time" : "00:20:00"
}
}
Any string in the cluster configuration can be formatted in the same way as shell commands, e.g. ``{rule}.{wildcards.sample}`` is formatted to ``a.xy`` if the rulename is ``a`` and the wildcard value is ``xy``.
Here ``__default__`` is a special object that specifies default parameters, these will be inherited by the other configuration objects. The ``compute1`` object here changes the ``time`` parameter, but keeps the other parameters from ``__default__``. The rule ``compute2`` does not have any configuration, and will therefore use the default configuration. You can then run the Snakefile with the following command on a SLURM system.

.. code-block:: console
$ snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n} -t {cluster.time}"
For cluster systems using LSF/BSUB, a cluster config may look like this:

.. code-block:: json
{
"__default__" :
{
"queue" : "medium_priority",
"nCPUs" : "16",
"memory" : 20000,
"resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\"",
"name" : "JOBNAME.{rule}.{wildcards}",
"output" : "logs/cluster/{rule}.{wildcards}.out",
"error" : "logs/cluster/{rule}.{wildcards}.err"
},
"trimming_PE" :
{
"memory" : 30000,
"resources" : "\"select[mem>30000] rusage[mem=30000] span[hosts=1]\"",
}
}
The advantage of this setup is that it is already pretty general by exploiting the wildcard possibilities that Snakemake provides via ``{rule}`` and ``{wildcards}``. So job names, output and error files all have reasonable and trackable default names, only the directies (``logs/cluster``) and job names (``JOBNAME``) have to adjusted accordingly.
If a rule named ``bamCoverage`` is executed with the wildcard ``basename = sample1``, for example, the output and error files will be ``bamCoverage.basename=sample1.out`` and ``bamCoverage.basename=sample1.err``, respectively.


---------------------------
Configure Working Directory
---------------------------
Expand All @@ -302,3 +217,12 @@ All paths in the snakefile are interpreted relative to the directory snakemake i
workdir: "path/to/workdir"
Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.


.. _snakefiles-cluster_configuration:

---------------------------------------------
Cluster Configuration (not supported anymore)
---------------------------------------------

The previously supported cluster configuration has been replaced by configuration profiles (see :ref:`profiles`).
5 changes: 3 additions & 2 deletions docs/tutorial/additional_features.rst
Expand Up @@ -77,8 +77,9 @@ For this, Snakemake provides the ``include`` directive to include another Snakef

.. code:: python
include: "path/to/other.snakefile"
include: "path/to/other.smk"
As can be seen, the default file extensions for snakefiles other than the main snakefile is ``.smk``.
Alternatively, Snakemake allows to **define sub-workflows**.
A sub-workflow refers to a working directory with a complete Snakemake workflow.
Output files of that sub-workflow can be used in the current Snakefile.
Expand Down Expand Up @@ -235,7 +236,7 @@ The **DRMAA support** can be activated by invoking Snakemake as follows:
$ snakemake --drmaa --jobs 100
If available, **DRMAA is preferable over the generic cluster modes** because it provides better control and error handling.
To support additional cluster specific parametrization, a Snakefile can be complemented by a :ref:`snakefiles-cluster_configuration` file.
To support additional cluster specific parametrization, a Snakefile can be complemented by a workflow specific profile (see :ref:`profiles`).

Using --cluster-status
::::::::::::::::::::::
Expand Down
5 changes: 3 additions & 2 deletions setup.cfg
Expand Up @@ -50,6 +50,7 @@ install_requires =
requests
reretry
smart_open >=3.0
snakemake-interface-executor-plugins
stopit
tabulate
throttler
Expand Down Expand Up @@ -78,8 +79,8 @@ reports = pygments

[options.entry_points]
console_scripts =
snakemake = snakemake:main
snakemake-bash-completion = snakemake:bash_completion
snakemake = snakemake.cli:main
snakemake-bash-completion = snakemake.cli:bash_completion

[options.packages.find]
include = snakemake, snakemake.*
Expand Down

0 comments on commit c9eaa4e

Please sign in to comment.