Skip to content
Permalink
Browse files
docs: checkpoint documentation (#1562)
* first draft at modifying the  documentation

* reworked, fix typo, link to directory article

* Update docs/snakefiles/rules.rst

* Update docs/snakefiles/rules.rst

Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
  • Loading branch information
gregdenay and johanneskoester committed Apr 25, 2022
1 parent 1adb144 commit 4cbfb4786a729a0c899a0a3e0427c1c1f0796c15
Showing 1 changed file with 61 additions and 23 deletions.
@@ -999,6 +999,8 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i
shell:
"somecommand {input} {output}"
.. _snakefiles-directory_output:

Directories as outputs
----------------------

@@ -1865,8 +1867,62 @@ Instead, the output file will be opened, and depending on its contents either ``
This way, the DAG becomes conditional on some produced data.

It is also possible to use checkpoints for cases where the output files are unknown before execution.
A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file.
Consider the following example:
Consider the following example where an arbitrary number of files is generated by a rule before being aggregated:

.. code-block:: python
# a target rule to define the desired final output
rule all:
input:
"aggregated.txt"
# the checkpoint that shall trigger re-evaluation of the DAG
# an number of file is created in a defined directory
checkpoint somestep:
output:
directory("my_directory/")
shell:
"mkdir my_directory/;"
"for i in 1 2 3; do touch $i.txt; done"
# input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
def aggregate_input(wildcards):
checkpoint_output = checkpoints.export_sequences.get(**wildcards).output[0]
return expand("my_directory/{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)
rule aggregate:
input:
aggregate_input
output:
"aggegated.txt"
shell:
"cat {input} > {output}"
Because the number of output files is unknown beforehand, the checkpoint only defines an output :ref:`directory <snakefiles-directory_output>`.
This time, instead of explicitly writing

.. code-block:: python
checkpoints.clustering.get(sample=wildcards.sample).output[0]
we use the shorthand

.. code-block:: python
checkpoints.clustering.get(**wildcards).output[0]
which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensures that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
Because the wildcard ``i`` is evaluated only after completion of the checkpoint, it is nescessay to use ``directory`` to declare its output, instead of using the full wildcard patterns as output.

A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file.
In this example the clusters are being processed by an intermediate rule before being aggregated:

.. code-block:: python
@@ -1914,27 +1970,9 @@ Consider the following example:
shell:
"cat {input} > {output}"
Here, our checkpoint simulates a clustering.
We pretend that the number of clusters is unknown beforehand.
Hence, the checkpoint only defines an output ``directory``.
The rule ``aggregate`` again uses the ``checkpoints`` object to retrieve the output of the checkpoint.
This time, instead of explicitly writing

.. code-block:: python
checkpoints.clustering.get(sample=wildcards.sample).output[0]
we use the shorthand

.. code-block:: python
checkpoints.clustering.get(**wildcards).output[0]
which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
These values are then used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.
Here a new directory will be created for each sample by the checkpoint.
After completion of the checkpoint, the ``aggregate_input`` function is re-evaluated as previously.
The values of the wildcard ``i`` is this time used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.


.. _snakefiles-rule-inheritance:

0 comments on commit 4cbfb47

Please sign in to comment.