feat: support for external executor plugins (#2305)

Hi @johanneskoester! 👋 As we chat about in a thread somewhere, I think it would be really powerful to allow for installing (and discovering) external plugins to Snakemake. Specifically for the Flux Operator, I have easily three designs I'm testing, and it's not really appropriate to add them proper to snakemake - but I believe the developer user should be empowered to flexibly add/remove and test them out. This pull request is first try demo of how snakemake could allow external executor plugins. I say "first try" because it's the first time I've experimented with plugins, and I tried to choose a design that optimizes simplicity and flexibility without requiring external packages, or specific features of setuptools or similar (that are likely to change). The basic design here uses pkgutil to discover snakemake_executor_* plugins, and then provides them to the client (to add arguments) and scheduler to select using one with `--executor`. I've written up an entire tutorial and the basic design in this early prototype, which is basically the current Flux integration as a plugin! https://github.com/snakemake/snakemake-executor-flux. The user would basically do: ```bash # Assuming this was released on pypi (it's not yet) $ pip install snakemake-executor-flux # Run the workflow using the flux custom executor $ snakemake --jobs 1 --executor flux ``` I've designed it so that plugins are validated only when chosen, and each plugin can add or otherwise customize the parser, and then (after parsing) further tweak the args if chosen. Then in scheduler.py, we simply look if the user selected a plugin, and call the main executor (and local_executor) classes if this is the case. The one hard piece is having a flexible way to pass forward all those custom arguments. The current snakemake design has basically a custom boolean for every executor hard coded (e.g., `--flux` or `--slurm`) and while we don't want to blow that up, I'm worried moving forward passing all these custom namespaced arguments through the init, workflow, scheduler/dag, is going to get very messy. So the approach here is a suggested way to handle the expanding space of additional executors by way of passing forward full args, and then allowing the plugins to customize the parser before or after. If we were to, for example, turn current executors into plugins (something I expect we might want to do for the Google Life Sciences API that is going to be deprecated in favor of batch) we could write out a more hardened spec - some configuration class that is passed from the argument parser through the executor and through execution (instead of all the one off arguments). Anyway - this is just a first shot and I'm hoping to start some discussion! This is a totally separate thing from TBA work with Google Batch - this is something that I've wanted to try for a while as I've wanted to add more executors and have seen the executor space exploding. :laughing: I haven't written tests or updated any docs yet pending our discussion! ### QC * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [ ] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). --------- Signed-off-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
snakemake · Aug 7, 2023 · c9eaa4e · c9eaa4e
1 parent 1c5d154
commit c9eaa4e
Show file tree

Hide file tree

Showing 51 changed files with 3,834 additions and 5,224 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -43,8 +43,7 @@ jobs:
 
   testing:
     runs-on: ubuntu-latest
-    # TODO comment in once slurm works again
-    # needs: formatting
+    needs: formatting
     services:
       mysql:
         image: mysql:8.0
@@ -169,8 +168,12 @@ jobs:
         shell: bash -el {0}
         run: |
           conda config --set channel_priority strict
+
           mamba env update -q -n snakemake --file test-environment.yml
 
+          # TODO remove and add as regular dependency once released
+          pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
+
           # additionally add singularity
 
           # TODO remove version constraint: needed because 3.8.7 fails with missing libz:
@@ -310,6 +313,9 @@ jobs:
         run: |
           conda config --set channel_priority strict
           mamba env update -q --file test-environment.yml
+
+          # TODO remove and add as regular dependency once released
+          pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
       - name: Run tests
         env:
           CI: true

diff --git a/.github/workflows/test-flux.yaml b/.github/workflows/test-flux.yaml
@@ -40,6 +40,8 @@ jobs:
         run: |
           conda config --set channel_priority strict
           mamba install python>=3.9 pip
+          # TODO remove and add as regular dependency once released
+          pip install git+https://github.com/snakemake/snakemake-interface-executor-plugins.git
           pip install .
 
       - name: Start Flux and Test Workflow

diff --git a/docs/project_info/contributing.rst b/docs/project_info/contributing.rst
@@ -57,7 +57,6 @@ Below you find a skeleton
                  quiet=False,
                  printshellcmds=False,
                  latency_wait=3,
-                 cluster_config=None,
                  local_input=None,
                  restart_times=None,
                  exec_job=None,
@@ -70,7 +69,6 @@ Below you find a skeleton
                              quiet=quiet,
                              printshellcmds=printshellcmds,
                              latency_wait=latency_wait,
-                             cluster_config=cluster_config,
                              local_input=local_input,
                              restart_times=restart_times,
                              assume_shared_fs=False, # if your executor relies on a shared file system, set this to True

diff --git a/docs/snakefiles/configuration.rst b/docs/snakefiles/configuration.rst
@@ -206,91 +206,6 @@ Validating PEPs
 
 Using the ``pepschema`` directive leads to an automatic parsing of the provided schema *and* PEP validation with the PEP validation tool -- `eido <http://eido.databio.org>`_. Eido schemas extend `JSON Schema <https://json-schema.org>`_ vocabulary to accommodate the powerful PEP features. Follow the `How to write a PEP schema <http://eido.databio.org/en/latest/writing-a-schema>`_ guide to learn more.
 
-.. _snakefiles-cluster_configuration:
-
-----------------------------------
-Cluster Configuration (deprecated)
-----------------------------------
-
-While still being possible, **cluster configuration has been deprecated** by the introduction of :ref:`profiles`.
-
-Snakemake supports a separate configuration file for execution on a cluster.
-A cluster config file allows you to specify cluster submission parameters outside the Snakefile.
-The cluster config is a JSON- or YAML-formatted file that contains objects that match names of rules in the Snakefile.
-The parameters in the cluster config are then accessed by the ``cluster.*`` wildcard when you are submitting jobs.
-Note that a workflow shall never depend on a cluster configuration, because this would limit its portability.
-Therefore, it is also not intended to access the cluster configuration from **within** the workflow.
-
-For example, say that you have the following Snakefile:
-
-.. code-block:: python
-
-    rule all:
-        input: "input1.txt", "input2.txt"
-
-    rule compute1:
-        output: "input1.txt"
-        shell: "touch input1.txt"
-
-    rule compute2:
-        output: "input2.txt"
-        shell: "touch input2.txt"
-
-This Snakefile can then be configured by a corresponding cluster config, say "cluster.json":
-
-
-.. code-block:: json
-
-    {
-        "__default__" :
-        {
-            "account" : "my account",
-            "time" : "00:15:00",
-            "n" : 1,
-            "partition" : "core"
-        },
-        "compute1" :
-        {
-            "time" : "00:20:00"
-        }
-    }
-
-Any string in the cluster configuration can be formatted in the same way as shell commands, e.g. ``{rule}.{wildcards.sample}`` is formatted to ``a.xy`` if the rulename is ``a`` and the wildcard value is ``xy``.
-Here ``__default__`` is a special object that specifies default parameters, these will be inherited by the other configuration objects. The ``compute1`` object here changes the ``time`` parameter, but keeps the other parameters from ``__default__``. The rule ``compute2`` does not have any configuration, and will therefore use the default configuration. You can then run the Snakefile with the following command on a SLURM system.
-
-.. code-block:: console
-
-    $ snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n}  -t {cluster.time}"
-
-
-For cluster systems using LSF/BSUB, a cluster config may look like this:
-
-.. code-block:: json
-
-    {
-        "__default__" :
-        {
-            "queue"     : "medium_priority",
-            "nCPUs"     : "16",
-            "memory"    : 20000,
-            "resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\"",
-            "name"      : "JOBNAME.{rule}.{wildcards}",
-            "output"    : "logs/cluster/{rule}.{wildcards}.out",
-            "error"     : "logs/cluster/{rule}.{wildcards}.err"
-        },
-
-
-        "trimming_PE" :
-        {
-            "memory"    : 30000,
-            "resources" : "\"select[mem>30000] rusage[mem=30000] span[hosts=1]\"",
-        }
-    }
-
-The advantage of this setup is that it is already pretty general by exploiting the wildcard possibilities that Snakemake provides via ``{rule}`` and ``{wildcards}``. So job names, output and error files all have reasonable and trackable default names, only the directies (``logs/cluster``) and job names (``JOBNAME``) have to adjusted accordingly.
-If a rule named ``bamCoverage`` is executed with the wildcard ``basename = sample1``, for example, the output and error files will be ``bamCoverage.basename=sample1.out`` and ``bamCoverage.basename=sample1.err``, respectively.
-
-
 ---------------------------
 Configure Working Directory
 ---------------------------
@@ -302,3 +217,12 @@ All paths in the snakefile are interpreted relative to the directory snakemake i
     workdir: "path/to/workdir"
 
 Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.
+
+
+.. _snakefiles-cluster_configuration:
+
+---------------------------------------------
+Cluster Configuration (not supported anymore)
+---------------------------------------------
+
+The previously supported cluster configuration has been replaced by configuration profiles (see :ref:`profiles`).
diff --git a/docs/tutorial/additional_features.rst b/docs/tutorial/additional_features.rst
@@ -77,8 +77,9 @@ For this, Snakemake provides the ``include`` directive to include another Snakef
 
 .. code:: python
 
-    include: "path/to/other.snakefile"
+    include: "path/to/other.smk"
 
+As can be seen, the default file extensions for snakefiles other than the main snakefile is ``.smk``.
 Alternatively, Snakemake allows to **define sub-workflows**.
 A sub-workflow refers to a working directory with a complete Snakemake workflow.
 Output files of that sub-workflow can be used in the current Snakefile.
@@ -235,7 +236,7 @@ The **DRMAA support** can be activated by invoking Snakemake as follows:
     $ snakemake --drmaa --jobs 100
 
 If available, **DRMAA is preferable over the generic cluster modes** because it provides better control and error handling.
-To support additional cluster specific parametrization, a Snakefile can be complemented by a :ref:`snakefiles-cluster_configuration` file.
+To support additional cluster specific parametrization, a Snakefile can be complemented by a workflow specific profile (see :ref:`profiles`).
 
 Using --cluster-status
 ::::::::::::::::::::::

diff --git a/setup.cfg b/setup.cfg
@@ -50,6 +50,7 @@ install_requires =
     requests
     reretry
     smart_open >=3.0
+    snakemake-interface-executor-plugins
     stopit
     tabulate
     throttler
@@ -78,8 +79,8 @@ reports = pygments
 
 [options.entry_points]
 console_scripts =
-    snakemake = snakemake:main
-    snakemake-bash-completion = snakemake:bash_completion
+    snakemake = snakemake.cli:main
+    snakemake-bash-completion = snakemake.cli:bash_completion
 
 [options.packages.find]
 include = snakemake, snakemake.*