Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcement: Change in rerun behavior in Snakemake 7.8 #1694

Open
johanneskoester opened this issue May 31, 2022 · 33 comments
Open

Announcement: Change in rerun behavior in Snakemake 7.8 #1694

johanneskoester opened this issue May 31, 2022 · 33 comments

Comments

@johanneskoester
Copy link
Contributor

Snakemake 7.8 changes its rerun behavior. Before, rerunning jobs relied purely on file modification times.
Now, also provenance information is considered, like parameter changes, code changes, software environment changes, and changes in the set of input files of a job.

This is intentional, because it provides a higher degree of safety and reproducibility, such that no changes in the workflow definition are missed and the results on disk always represent the state of the codebase.

However, it also has the downside that you might sometimes experience a job rerun that happens just because of cosmetic changes like the removal of some whitespace or the addition of a comment, which might happen from time to time during development.

If you don't want such reruns, you have the following options, which are also displayed by Snakemake at the end of a dryrun in such cases:

  • If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
  • If you are sure that a change for a certain output file (say, ) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via snakemake --cleanup-metadata <outfile>.
@chrarnold
Copy link
Contributor

Thanks for the post Johannes, this helps. One additional question: Say I modify a rule and want to cleanup the metadata for multiple files, either say the output file for x samples or x output files resulting from one run of the rule. How to cleanup the metadata then if it is more than one file? Can wildcards be provided? Some guidance here would be good

@johanneskoester
Copy link
Contributor Author

At the moment, one has to provide all of the outfiles, but you can use bash wildcards (*) for that purpose.

@jonas-eschle
Copy link

jonas-eschle commented Jun 7, 2022

First of all, thanks a lot @johanneskoester for this awesome package! I very much appreciate all the work you put into it.

I am however a bit critical of this change: what exactly will now trigger a rerun? Code changes, which ones? How can it recognize which code was used in general, or just depending on the directive, say "script"? I didn't find the information, could you maybe point me to some resources? I am also a bit worried that it seems to have been introduced as an opt-out in a minor version release breaking current behavior.

@corneliusroemer
Copy link
Contributor

--touch doesn't seem to work when there's a code change. I could get rid of Reason: Params have changed since last execution by running --touch but now I get reason: code has changed since last execution.

However, --rerun-triggers mtime worked as suggested above. Now the only issue is I have to remember and type that string 👀

@johanneskoester
Copy link
Contributor Author

@corneliusroemer you can always write your own profile to fix --rerun-triggers mtime if you prefer that behavior.

@johanneskoester
Copy link
Contributor Author

--touch doesn't seem to work when there's a code change. I could get rid of Reason: Params have changed since last execution by running --touch but now I get reason: code has changed since last execution.

To mark output files as "I don't care about those changes", it is recommended to use the --cleanup-metadata approach for them as outlined above and in the message that Snakemake prints at the end of a dryrun that contains such triggers.

@jonas-eschle
Copy link

To elaborate, I just discovered that versions before 7.8 do not accept the --rerun-triggers mtime argument. This is actually a bit of a problem, meaning that 7.7 and 7.8 are incompatible and have changed behavior.
In general, I would suggest to leave this incompatible, API breaking changes for major releases if they are not urgently needed or clearly the better choice.

Here, the consistency and predictability of the behavior is probably more important.

Actually, I am also somewhat surprised by the "code changes": AFAIR, snakemake stopped to trace code and left it instead to other VCS (which makes sense). But now it seems to somewhat re-track changes.

UPDATE (as our posts just crossed):

you can always write your own profile to fix --rerun-triggers mtime if you prefer that behavior.

It still is an incompatible breaking change. And not easy to circumvent, as there is no default profile, so one needs to be introduced. And most importantly, it seems not documented (I didn't find anything), while it seems crucial to know, what exactly can trigger a rerun

@johanneskoester
Copy link
Contributor Author

First of all, thanks a lot @johanneskoester for this awesome package! I very much appreciate all the work you put into it.

I am however a bit critical of this change: what exactly will now trigger a rerun? Code changes, which ones? How can it recognize which code was used in general, or just depending on the directive, say "script"? I didn't find the information, could you maybe point me to some resources? I am also a bit worried that it seems to have been introduced as an opt-out in a minor version release breaking current behavior.

Hi @jonas-eschle. First of all, yes, I hear you, it would have been better to make this a major version bump. Technically it was not a breaking change as I considered breaking changes so far (language did not change, no feature was removed, workflows will still work exactly the same). Rather, I even thought of this more of as a fix for an incomplete feature from before (as Snakemake now correctly recognizes all these things that it missed before and required people to manually enforce reexecution). But from a user perspective it can be somehow breaking in certain cases.

Regarding the details: code change means that the hash of the shell command or run block has changed, or in case of script that the script modification date has changed. This is course means that the also cosmetic changes to the script/run/shell of a rule trigger the rerun. It took me a long time to finally go for this decision, but in the end I thought this is better than to silently ignore any of these changes. Rather be conservative to ensure that results are guaranteed to be consistent with the current state of the codebase on disk, instead of silently ignoring such changes, which makes reliable results harder to achieve for beginners and requires quite some deep knowlegde on Snakemake to take care of reruns in such a case.

In the end, the behavior can always be configured (via --profile), but I just thought that the conservative approach is a good default for beginner users. During development, it can sometimes of course be a good idea to just use the mtime trigger, when you know what you are doing.

@melund
Copy link
Contributor

melund commented Jul 5, 2022

@johanneskoester I like the change. I previously often added the scripts as inputs in the rule to trigger reruns.

Just an idea: for the run: and script:(python) blocks maybe we could hash the abstract syntax tree (AST) to only trigger reruns on actual code changes. Like how the black formatter ensures that it never changes your code in functional ways when reformatting.

@pvandyken
Copy link
Contributor

pvandyken commented Jul 13, 2022

Thanks for the post Johannes, this helps. One additional question: Say I modify a rule and want to cleanup the metadata for multiple files, either say the output file for x samples or x output files resulting from one run of the rule. How to cleanup the metadata then if it is more than one file? Can wildcards be provided? Some guidance here would be good

@chrarnold Perhaps you've figured this out already, but I use the following to clean up metadata following a code change:

snakemake --cleanup-metadata $(snakemake --list-code-changes -q all)

Also works with --list-input-changes and --list-params-changes. Note that you need -q all so that only the file names are printed. And this will print EVERY file with code changes, so it's not selective for, say, one specific rule.

@ning-y
Copy link
Contributor

ning-y commented Aug 4, 2022

Just to document a problem I've run into, and a fix:

With projects first using earlier versions of Snakemake, I've experienced some issues using later versions of Snakemake with this new re-run behavior (tested on 7.12.0). Snakemake would consistently and repeatedly falsely identify certain output files as having changed since the last run (e.g. params have changed, or inputs have changed), even though the very last thing I did was to run that exact same Snakemake command.

I figured this was due to some incompatibility in the .snakemake metadata between versions. Deleting or renaming the .snakemake directory, so that it would have to be regenerated, fixed these errors for me. The re-run behavior is now consistent and logically, and for what it's worth I think this newer stricter behavior is much better than the last mtime-based one.

@mkiyer
Copy link

mkiyer commented Aug 24, 2022

I agree with ning-y above. I think I'm seeing a bug in the way re-run behavior is being determined. Before I get any further -- thank you for this wonderful tool! I continue to be impressed with it. You have built it into quite a large apparatus!

My current workflow always reruns. For some reason, the tool is detecting parameter changes every single time the tool is rerun. In fact, downstream rules picking up parameter changes during the same run which is causing the tool to re-run all of its upstream rules. Thus, the same rules are getting re-run and re-run over and over again. I am trying to track this down and fix it.

Here is the relevant section of my workflow (GATK4 variant calling). Could the problem have something to do with parameters as functions?

`rule gatk_mark_duplicates_spark:
conda:
"envs/gatk4.yaml"
input:
rules.star.output.bam
output:
bam = "runs/{run_id}/gatk/aln.markdup.bam",
bai = "runs/{run_id}/gatk/aln.markdup.bam.bai",
metrics = "runs/{run_id}/gatk/markdup_metrics.txt"
params:
java_options = get_java_resource_options,
spark_executor_cores = lambda wildcards, threads: threads - 1,
spark_executor_instances = lambda wildcards, threads: threads - 1,
spark_executor_memory = lambda wildcards, resources, threads: round((resources.mem_mb - 2000.0) / threads)
threads: 8
resources:
mem_mb = 32000
shell:
"gatk MarkDuplicatesSpark "
'--java-options "{params.java_options}" '
"-I {input} "
"-O {output.bam} "
"-M {output.metrics} "
"--treat-unsorted-as-querygroup-ordered "
"--remove-all-duplicates "
"--conf 'spark.executor.instances={params.spark_executor_instances}' "
"--conf 'spark.executor.cores={params.spark_executor_cores}' "
"--conf 'spark.executor.memory={params.spark_executor_memory}M' "
"--conf 'spark.local.dir={resources.tmpdir}'"

rule gatk_split_n_cigar_reads:
conda:
"envs/gatk4.yaml"
input:
bam = rules.gatk_mark_duplicates_spark.output.bam,
genome_fasta = get_ref_path('genome_fasta')
output:
bam = "runs/{run_id}/gatk/aln.markdup.splitncigar.bam"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 20000
shell:
"gatk SplitNCigarReads "
'--java-options "{params.java_options}" '
"-R {input.genome_fasta} "
"-I {input.bam} "
"-O {output.bam}"

rule gatk_bqsr:
conda:
"envs/gatk4.yaml"
input:
bam = rules.gatk_split_n_cigar_reads.output.bam,
genome_fasta = get_ref_path('genome_fasta'),
dbsnp_vcf = config['gatk']['dbsnp_vcf'],
indels_vcf = config['gatk']['indels_vcf']
output:
bqsr_table = "runs/{run_id}/gatk/bqsr_recal.table"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 16000
shell:
"gatk BaseRecalibrator "
'--java-options "{params.java_options}" '
"-R {input.genome_fasta} "
"-I {input.bam} "
"-O {output.bqsr_table} "
"--use-original-qualities "
"-known-sites {input.dbsnp_vcf} "
"-known-sites {input.indels_vcf}"

rule gatk_apply_bqsr:
conda:
"envs/gatk4.yaml"
input:
bam = rules.gatk_split_n_cigar_reads.output.bam,
bqsr_table = rules.gatk_bqsr.output.bqsr_table,
genome_fasta = get_ref_path('genome_fasta')
output:
bam = "runs/{run_id}/gatk/aln.bqsr.bam"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 16000
shell:
"gatk ApplyBQSR "
'--java-options "{params.java_options}" '
"-R {input.genome_fasta} "
"-I {input.bam} "
"--bqsr-recal-file {input.bqsr_table} "
"-O {output.bam}"

rule gatk_analyze_covariates:
conda:
"envs/gatk4.yaml"
input:
bqsr_table = rules.gatk_bqsr.output.bqsr_table
output:
pdf = "runs/{run_id}/gatk/analyze_covariates.pdf",
csv = "runs/{run_id}/gatk/analyze_covariates.csv"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 16000
shell:
"gatk AnalyzeCovariates "
'--java-options "{params.java_options}" '
"-bqsr {input.bqsr_table} "
"-plots {output.pdf} "
"-csv {output.csv}"

rule gatk_haplotype_caller:
conda:
"envs/gatk4.yaml"
input:
genome_fasta = get_ref_path('genome_fasta'),
dbsnp_vcf = config['gatk']['dbsnp_vcf'],
wgs_calling_intervals = config['gatk']['wgs_calling_intervals'],
bam = rules.gatk_apply_bqsr.output.bam
output:
vcf = "runs/{run_id}/gatk/variants.vcf.gz"
params:
java_options = get_java_resource_options,
stand_call_conf = config['gatk']['haplotype_caller_stand_call_conf']
threads: 4
resources:
mem_mb = 16000
shell:
"gatk HaplotypeCaller "
'--java-options "{params.java_options}" '
"-R {input.genome_fasta} "
"-I {input.bam} "
"-O {output.vcf} "
"-L {input.wgs_calling_intervals} "
"-dont-use-soft-clipped-bases "
"-stand-call-conf {params.stand_call_conf} "
"--dbsnp {input.dbsnp_vcf}"

rule gatk_variant_filtration:
conda:
"envs/gatk4.yaml"
input:
genome_fasta = get_ref_path('genome_fasta'),
vcf = rules.gatk_haplotype_caller.output.vcf
output:
vcf = "runs/{run_id}/gatk/variants.filtered.vcf.gz"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 16000
shell:
"gatk VariantFiltration "
'--java-options "{params.java_options}" '
"--R {input.genome_fasta} "
"--V {input.vcf} "
"-O {output.vcf} "
"--window 35 "
"--cluster 3 "
'--filter-name "FS" '
'--filter-expression "FS > 30.0" '
'--filter-name "QD" '
'--filter-expression "QD < 2.0"'

rule gatk_funcotator:
conda:
"envs/gatk4.yaml"
input:
genome_fasta = get_ref_path('genome_fasta'),
vcf = rules.gatk_variant_filtration.output.vcf
output:
maf = "runs/{run_id}/gatk/variants.filtered.funcotator.maf.gz"
params:
java_options = get_java_resource_options,
data_source_dir = config['gatk']['funcotator_data_sources']
threads: 3
resources:
mem_mb = 8000
shell:
"gatk Funcotator "
'--java-options "{params.java_options}" '
"--R {input.genome_fasta} "
"--V {input.vcf} "
"-O {output.maf} "
"--output-file-format MAF "
"--data-sources-path {params.data_source_dir} "
"--ref-version hg38"
`

@ning-y
Copy link
Contributor

ning-y commented Aug 24, 2022

Thanks for adding on, @mkiyer. As an update, deleting the .snakemake directory will fix that instance of the "always re-run" bug. But even with the freshly generated .snakemake directory, on Snakemake 7.12.0, this "always re-run" bug will sometimes transiently re-appear. If I ask for rule A, Snakemake always tries to rerun a set of dependencies {B, C, D, ...}; even after rule A has just been successfully run, or even after using --touch on dependencies {B, C, D, ...}.

Nonetheless I still agree with the theory behind this new change to re-run behaviour. Is there a way we can report informative and helpful debug logs for the development team?

@ning-y
Copy link
Contributor

ning-y commented Sep 2, 2022

I'll just like to direct further discussion about unexpected re-executions to this later issue, which to my knowledge presents for the first time a reprex: #1818.

@jonas-eschle
Copy link

jonas-eschle commented Dec 4, 2022

I had a more thorough thought on this change.

Rather, I even thought of this more of as a fix for an incomplete feature from before (as Snakemake now correctly recognizes all these things that it missed before and required people to manually enforce reexecution). But from a user perspective it can be somehow breaking in certain cases.

The big question that I have not really found answered anywhere is, what is actually the scope of Snakemake? I've always thought it to be a mature pipeline tool with predictive behavior, but IMHO it becomes a lot more a very unreliable, unstable package in terms of behavior. So what is the actual goal of the package, was automatically recognizing changes (of what exactly) always a goal?

What I mean by unpredictive is: asking the question whether snakemake automatically reruns if your code changes, there is no yes and no. There is a "maybe". For example, if your script changes. But not if your imported function (AFAIU) changes. Also, if you're merely fixing a typo in a print. And only if it's called as a script. Not if you use shell. Or something like that. My understanding was that since it is impossible to know for sure what code actually changed and what not and how it affects things, that's the very reason the -R exists: it allows the user to tell, which rules are invalidated. If I change something in my code, it may or may not be detected by snakemake (depending where exactly it is). -R allows me to specify what is affected.

I see the temptation to have a "framework that does just everything perfectly fine". But that's unrealistic. It may solves a few nice cases for a few beginner users, but then, for most users, it leads to unpredictable behavior (because I always need to think whether it now detects the change or not). This is already invented though, it's VCS like Git. A user can (should) use it to detect changes and judge if it affects the output or not.

The change above is mostly targeted towards a very specific audience: one that runs workflows in the order of minutes, not days or weeks. Imagine your workflow runs for a week. Should reruns really be triggered by any change in the script by default? Don't you think this is better left to the user?

In short, this changes to me looks like a backup software that automatically sometimes backs up some of the files and sometimes not and you can never be sure which files are actually backed up. The solution is to make it reliable and tell it what to backup and when.

Regarding the details: code change means that the hash of the shell command or run block has changed, or in case of script that the script modification date has changed. This is course means that the also cosmetic changes to the script/run/shell of a rule trigger the rerun. It took me a long time to finally go for this decision, but in the end I thought this is better than to silently ignore any of these changes. Rather be conservative to ensure that results are guaranteed to be consistent with the current state of the codebase on disk, instead of silently ignoring such changes, which makes reliable results harder to achieve for beginners and requires quite some deep knowlegde on Snakemake to take care of reruns in such a case.

This is what I actually mostly disagree with, the "easier for beginners". IMHO, good for beginners aren't the "magic tools" that sometimes work and sometimes not - there is no way to ever understand this. For beginners it is much more useful to have consistent behavior. No magic "sometimes it reruns, sometimes it doesn't". That's confusing.

What are the "internals" of snakemake that a beginner needs to unerstand? I think it's easier without: any input file that is changed, will trigger rerun. Any code that is changed, won't. Simple. Let's compare that to the behavior now: how would you explain to a beginner when his workflow is rerun and when it isn't? (I mean this as a very serious point - most probably, this is causing a lot of harm because people now maybe rely on the rerun while it doesn't do it for all of it).

In the end, the behavior can always be configured (via --profile), but I just thought that the conservative approach is a good default for beginner users. During development, it can sometimes of course be a good idea to just use the mtime trigger, when you know what you are doing.

I disagree: straightforward beats magic IMHO for beginners.

I do deeply care about snakemake: I use it myself and we're using it at our experiment at CERN, even teaching it to new students. But the longer the more I find myself stuck with weird bugs and unpredictable behavior by trying to "magically" make code work for an absolute beginner (such as #308). It's not gonna work. There is no way around understanding basic principles, otherwise there are so many more ways people will create bugs.

lpla added a commit to bitextor/bitextor that referenced this issue Dec 4, 2022
@ning-y
Copy link
Contributor

ning-y commented Dec 4, 2022

@jonas-eschle

The eager re-run behavior is not necessarily bad for large, complex pipelines

I use Snakemake for large genomics-scale pipelines, many of which takes more than a week to run from start to finish. I support this change in the default re-run behavior, because it encourages reproducibility. Especially because I work with large pipelines, it is easy for me to forget (or feel that I had forgotten) to re-run all downstream rules of a minor yet important change.

When Snakemake does not re-run rules due to code changes, it trading convenience for reproducibility. This is not necessarily a bad thing, but I think that this is an explicit trade-off that must be made by the user, not the software. That is why it makes sense for me that eager re-runs are the default, but can be disabled easily via config options in an opt-in manner.

The eager re-run behavior is not necessarily bad for beginners

This is something close to my heart. Beginners are not only in classrooms, they are in laboratories, doing funded research. What is the cost of having eager re-runs, for students in a classroom? They have to deepen their understanding of workflow tools, or perhaps they must add a config file disabling it, or they must watch their toy examples run for a few more minutes?

If a researcher is working on publishable data, but they are a beginner and do not know to re-run all their downstream rules, or cannot keep track of which to run; then the cost is irreproducible results. I have seen this happen, in work that was meant to be published.

The eager re-run behavior is neither unpredictable nor magic

Reasons for re-running rules are clearly stated in the console output of Snakemake. It is written both at the rule-level, and again as a summary at the end. I think this was a very thoughtful change.

AFAIK Snakemake has never claimed to be mature software. I do not think it is even that old. It is also managed and developed by the open-source scientific community, so I think even the timeline to maturity is pretty far in the future.

Discussions on issues pages are biased toward... issues

I feel a lot of thought has gone into this change, and on this particular issue there is more disagreement than agreement. I think it should be noted that users in active support of this change self-select against posting anything, because they don't have an issue with it (and scientists are busy!)

lpla added a commit to bitextor/bitextor that referenced this issue Dec 4, 2022
lpla added a commit to bitextor/bitextor that referenced this issue Dec 4, 2022
@jonas-eschle
Copy link

jonas-eschle commented Dec 4, 2022

@ning-y many thanks for this insights! Maybe to make my point a bit more clear overall: I'd love to see a tool that automatically reruns (i.e. invalidates) if your scripts have changed. I would claim that this is nearly impossible and the way that snakemake works now, definitely not there (let me be more specific on this at the end).

[...] Especially because I work with large pipelines, it is easy for me to forget (or feel that I had forgotten) to re-run all downstream rules of a minor yet important change.

I don't quite get this point: do you forget to run the downstream rules, i.e. you run the -R [Rules where the code has changed] (which I don't understand; snakemake will run them for you?) or do you forget to set the -R correctly?

When Snakemake does not re-run rules due to code changes, it trading convenience for reproducibility. This is not necessarily a bad thing, but I think that this is an explicit trade-off that must be made by the user, not the software. That is why it makes sense for me that eager re-runs are the default, but can be disabled easily via config options in an opt-in manner.

I fully agree on the opt-in, there is nothing in the way to provide this option. It's still a faulty option though: sometimes it works, sometimes it won't.

This is something close to my heart. Beginners are not only in classrooms, they are in laboratories, doing funded research. What is the cost of having eager re-runs, for students in a classroom? They have to deepen their understanding of workflow tools, or perhaps they must add a config file disabling it, or they must watch their toy examples run for a few more minutes?

I agree that they have to deepen their understanding of workflow tools. There is no free lunch, there is no tool that runs correctly without some understanding. You need to familiarize yourself with when a rule is re-run and when it isn't. My point here is that it's a lot harder now to understand for a beginner, when this actually happens (see below)

If a researcher is working on publishable data, but they are a beginner and do not know to re-run all their downstream rules, or cannot keep track of which to run; then the cost is irreproducible results. I have seen this happen, in work that was meant to be published.

I fully agree also on this point. I just do not think that snakemake solves it. It tries and maybe does more harm than good by sometimes catching it (again, see also below).

The eager re-run behavior is neither unpredictable nor magic

Reasons for re-running rules are clearly stated in the console output of Snakemake. It is written both at the rule-level, and again as a summary at the end. I think this was a very thoughtful change.

Here is where we disagree: I do think that it's highly unpredictable. Let's go through a few examples

  • changing the code under "shell" will rerun.
  • changing a script that is called in a shell won't rerun
  • changing a script that is called in a "script" will rerun
  • changing a file containing a function that is imported into a script that is called with "script" does not rerun
  • adding a global statement in a snakefile that... a) affects some rules (i.e. modifies a variable) or b) has no sideeffects (such as debugging printouts), will this rerun?

... and that's just about the rules for the code changes!

My main point here is that it's highly unpredictable. Maybe you disagree and maybe I just fail to see the deeper logic, but could you maybe explain in simple word the rules? If that's not possible, it's a sign that it's maybe overcomplicated and therefore unpredictable.

The solution to this problem seems straighforward: add any dependency that should trigger a rerun into the input section. That solves not only all of the problems at once (i.e. you could also just add the AST of a file as an input and check if changed ;) ) but also makes all of the explanations above obsolete and reduces the complexity to the rules of rerun when an input file has changed.

This seems to me so much clearer for advanced users as well as for beginners. What is gained by adding this "magic" instead of adding the input files explicitly?

What does it do bad?

One could argue that this reruns a few cases, even if not all, catching some is better than nothing. I don't think so: this gives a wrong impression of well functioning automatism. See the script example: if you run it under script, all fine. Now, later, you move it to a shell command because you may wanna add something beforehand. And now you need to rerun it.

I am afraid that this is a lot harder to spot and introduces non-reproducible experiments that are not even noticed by the author. Can you blame someone for this?

AFAIK Snakemake has never claimed to be mature software. I do not think it is even that old. It is also managed and developed by the open-source scientific community, so I think even the timeline to maturity is pretty far in the future.

By "mature" I meant that the scope is established and fundamentals such as the overall rerun rules won't just change. I am fully aware of OSS, especially the scientific OSS, and it is clear that maturity may be defined differently.

Discussions on issues pages are biased toward... issues

I feel a lot of thought has gone into this change, and on this particular issue there is more disagreement than agreement. I think it should be noted that users in active support of this change self-select against posting anything, because they don't have an issue with it (and scientists are busy!)

I am gonna make a pretty strong claim here, feel free to disagree: the people that are happy with this are the ones that do not fully understand the failures of it (yet). It adds a lot of complexity: every user now has to fully understand all of the code rerunning rules (or disable them) i.e. all the examples mentioned above.

It's not a "I do everything right" mechanism. It's a "sometimes right, sometimes not" mechanism".

@ning-y
Copy link
Contributor

ning-y commented Dec 4, 2022

I don't quite get this point: do you forget to run the downstream rules, i.e. you run the -R [Rules where the code has changed] (which I don't understand; snakemake will run them for you?) or do you forget to set the -R correctly?

I did use to use -R, but -R was opt-in, and it had to be manually populated. Over time I would second guess if I had remembered to -R correctly for all the changes I had made, so I would have to run the whole thing from scratch e.g. when I am on vacation, or if I was going to spend a week focusing on a different project.

I fully agree on the opt-in, there is nothing in the way to provide this option. It's still a faulty option though: sometimes it works, sometimes it won't.

I'm glad we have this common ground! I do not think it is faulty, and I will address this below.

I agree that they have to deepen their understanding of workflow tools. There is no free lunch, there is no tool that runs correctly without some understanding. You need to familiarize yourself with when a rule is re-run and when it isn't. My point here is that it's a lot harder now to understand for a beginner, when this actually happens (see below)

Here is where we disagree: I do think that it's highly unpredictable. Let's go through a few examples

* changing the code under "shell" will rerun.

* changing a script that is called in a shell won't rerun

* changing a script that is called in a "script" _will_ rerun

* changing a file containing a function that is imported into a script that is called with "script" does not rerun

* adding a global statement in a snakefile that... a) affects some rules (i.e. modifies a variable) or b) has no sideeffects (such as debugging printouts), will this rerun?

... and that's just about the rules for the code changes!

Although it requires more effort to understand these changes, it is not impossible to understand. So, these are not unpredictable behaviors, but seemingly unpredictable behaviors which can be understood. The beginner can disable these behaviors, and it is explained clearly in the documentation and this announcement how it can be disabled. The benefit of keeping this behavior despite its apparent unpredictability is an added stringency on the reproducibility of a workflow, and I think that is reasonable.

As you have written earlier, these are also reasonable gaps in the detection of code changes. Once someone graduates from a "beginner" that can add additional scripts outside of the script directive as inputs, and then the re-run behavior is consistent.

Changing a global variable in the Snakefile re-runs a rule if it changes the shell directive (detected as a code change) or params (detected as a param change).

My main point here is that it's highly unpredictable. Maybe you disagree and maybe I just fail to see the deeper logic, but could you maybe explain in simple word the rules? If that's not possible, it's a sign that it's maybe overcomplicated and therefore unpredictable.

Alas it is a sign that it is midnight where I live and I am rushing some work. :(

The solution to this problem seems straighforward: add any dependency that should trigger a rerun into the input section. That solves not only all of the problems at once (i.e. you could also just add the AST of a file as an input and check if changed ;) ) but also makes all of the explanations above obsolete and reduces the complexity to the rules of rerun when an input file has changed.

Your solution to me is to move every script in my script directives, and every conda environment YAML file in my conda directives into the input directives. Mine to you is to add rerun-triggers: "mtime" to a project-wide config.yaml file. I think straight-forward is subjective.

This seems to me so much clearer for advanced users as well as for beginners. What is gained by adding this "magic" instead of adding the input files explicitly?

It encourages reproducible workflows, while still allowing users to explicitly opt-out. The explicit opt-out lets the user take responsibility of the reproducibility of their workflows.

One could argue that this reruns a few cases, even if not all, catching some is better than nothing. I don't think so: this gives a wrong impression of well functioning automatism. See the script example: if you run it under script, all fine. Now, later, you move it to a shell command because you may wanna add something beforehand. And now you need to rerun it.

Don't get me wrong, I understand the need to run scripts via a shell command. I do so to increase the mem limits on R sessions. I think the frequency of such wrong impressions is much lower than you describe. Beginners will assume a well functioning software no matter what: I was there when I started using Snakemake before this rerun behavior change. Non-beginners are always distrustful of it: I am going on vacation soon and you can be very sure I will be re-running everything from scratch. But while in active development I know that this behavior catches necessary re-runs much better than I humanly can, and I appreciate that.

I am afraid that this is a lot harder to spot and introduces non-reproducible experiments that are not even noticed by the author. Can you blame someone for this?

I really think the alternative of picking -R manually, or remembering to add every script to the input directive would be more prone to reproducibility errors.

I am gonna make a pretty strong claim here, feel free to disagree: the people that are happy with this are the ones that do not fully understand the failures of it (yet). It adds a lot of complexity: every user now has to fully understand all of the code rerunning rules (or disable them) i.e. all the examples mentioned above.

I disagree! It's a pretty bad faith take too. Feels like you calling me a big dummy :( Hurts me feelings

@jonas-eschle
Copy link

[...] The beginner can disable these behaviors, and it is explained clearly in the documentation and this announcement how it can be disabled.

... but isn't this the whole point of the discussion? The reasoning is that it's introduced because it's easier for beginners, but now you claim that beginners should disable it? So what is it, is it a good feature for beginner or is it not?

The benefit of keeping this behavior despite its apparent unpredictability is an added stringency on the reproducibility of a workflow, and I think that is reasonable.

... and the added unpredictability makes it harder for beginners to understand the process.

As you have written earlier, these are also reasonable gaps in the detection of code changes. Once someone graduates from a "beginner" that can add additional scripts outside of the script directive as inputs, and then the re-run behavior is consistent.

and before? What do they do before? They may or may not have the correct behavior. That's the problem: it is sold as an improvement for beginners, right?

Changing a global variable in the Snakefile re-runs a rule if it changes the shell directive (detected as a code change) or params (detected as a param change).

This is again half of it: yes, it sometimes changes it. But what if I have a function that uses this global? It won't rerun it as it possibly can't detect the change, AFAIU.

As a more advanced user and understanding (hypothetically) exactly what snakemake does, I could infer whether it reruns. But for a beginner? Hardly any chance. And again, this change is "for beginners", that's why it's opt-out not opt-in.

Alas it is a sign that it is midnight where I live and I am rushing some work. :(

Sure ;) But maybe you have time on another occasion to explain it nicely. I would have a very hard time explaining it to a beginner, actually even to an advanced user (and justifying it).

Your solution to me is to move every script in my script directives, and every conda environment YAML file in my conda directives into the input directives. Mine to you is to add rerun-triggers: "mtime" to a project-wide config.yaml file. I think straight-forward is subjective.

There is a difference in our solution: while mine requires more explicit work, it actually works. Let's be very clear:

the current, magic behavior, works sometimes and sometimes not

(*can we agree on this one?)

So you will need in some cases anyways to add it to the input. Sometimes. Like when it's a function that is imported into the script. Or if you're using the shell. Or...

"mtime"to a project-wideconfig.yaml` file. I think straight-forward is subjective.

There is no such thing, AFAIK (but please point me to it, as a project-wide config. There is not. There is no default profile in snakemake. Hence, this is not a working solution even.

It encourages reproducible workflows, while still allowing users to explicitly opt-out. The explicit opt-out lets the user take responsibility of the reproducibility of their workflows.

No, I think it does not encourage. The message, saying "your code change, maybe rerun" does "encourage". The behavior just runs, sometimes, giving a false impression as it reruns sometimes and sometimes it doesn't.

Don't get me wrong, I understand the need to run scripts via a shell command. I do so to increase the mem limits on R sessions. I think the frequency of such wrong impressions is much lower than you describe. Beginners will assume a well functioning software no matter what: I was there when I started using Snakemake before this rerun behavior change. Non-beginners are always distrustful of it: I am going on vacation soon and you can be very sure I will be re-running everything from scratch. But while in active development I know that this behavior catches necessary re-runs much better than I humanly can, and I appreciate that.

I fully agree. But the new behavior does not change that: you will still need to use the -R on some of the rules. So you will still do the exact same thing, i.e. sometimes just rerun the whole workflow.
I think you do the exact right thing. But wouldn't you do the exact same thing with or without the new behavior?

I really think the alternative of picking -R manually, or remembering to add every script to the input directive would be more prone to reproducibility errors.

And exactly this is the danger: the argument that picking -R manually is the alternative is the very problem I am talking about. Picking -R automatically is not what it does now. It picks it sometimes. So you will still sometimes need the manual -R.

But as you say, you seem to imply that this is not the case and that it "does it automatically now". Which is exactly the problematic believe.

It's not an alternative. It's additional. Not everything (again, imports, shell,...) are covered by this.

I disagree! It's a pretty bad faith take too. Feels like you calling me a big dummy :( Hurts me feelings

That wasn't meant to hurt and I surely didn't mean dummy, I said "not fully understand" (and gosh, how many things do we all not fully understand ;) ). But your statement above, considering manually invoking -R as the alternative is pretty much what I meant by that: it's not an alternative. Thinking that is doing more harm than good IMHO.

@ning-y
Copy link
Contributor

ning-y commented Dec 12, 2022

@jonas-eschle Sorry for the late reply! I've been a bit swamped lately. Let me try to organize our discussion into different sections.

The ease of use, goodness of a feature, and heterogeneity of "beginners"

I do not take responsibility for claims of the developers, so I cannot speak on their behalf; nor am I interested in selling Snakemake. I am just trying to present a yet unrepresented point of view, as far as I know, in this conversation.

I do not remember making the claim that this default is easy for beginners, but I do think that is is good for beginners. What is good is not necessarily easy. The goodness of this feature, in my opinion, is that it protects beginners from irreproducible results. I agree that this is less easy than not having this default, because it introduces some complexity. I think this this complexity is mitigated by the well thought-out console outputs printing reasons for re-runs.

I admit that this trade-off between preventing irreproducibility and ease-of-use is not suited for all beginners, because beginners are a heterogeneous group of users with different use cases. Nonetheless, I think that the safer mode should be the default mode.

An analogy: a lab coat protects scientists from contaminants, but it is less easy to work with a lab coat than without. There are scenarios in the lab where you might opt to not wear a lab coat (for example, you are just restocking the shelves), but by default beginners are asked to wear a lab coat. I think this is because we want them to be protected by default, and because they might not make well-informed decisions about the risk of an activity and if each activity requires a lab coat. Likewise, having default re-runs in Snakemake frees the beginner from making mistakes in when to re-run rules. They are free to opt-out, just as we are free to remove our lab coats; but for safety it is not the default.

I think fundamentally you and I disagree on what is good for beginners. I think safety is good, because in biology beginners are often running computational analyses to produce results with the intention to publish. Perhaps you teach classes on Snakemake so you prioritize ease-of-use, and I think that is reasonable too. I will avoid fence-sitting, however, and say that safety is more important than ease-of-use in this context where ease-of-use can be restored in the one-liner configuration file.

Global variables affecting a function

This is again half of it: yes, it sometimes changes it. But what if I have a function that uses this global? It won't rerun it as it possibly can't detect the change, AFAIU.

If the change in the value of the global variable changes the return value of an input or param function, it will re-run. For example, the following always re-runs:

import datetime

GLOBAL_TIME = datetime.datetime.now()

rule record_time:
    output: txt = "recorded-time.txt"
    params: the_time = lambda _: GLOBAL_TIME
    shell: "echo {params.the_time} > {output.txt}"

And in the console output it explains clearly why it re-runs:

[Mon Dec 12 23:26:58 2022]
rule record_time:
    output: recorded-time.txt
    jobid: 0
    reason: Params have changed since last execution
    resources: tmpdir=/tmp

[Mon Dec 12 23:26:58 2022]
Finished job 0.
1 of 1 steps (100%) done

Sure ;) But maybe you have time on another occasion to explain it nicely. I would have a very hard time explaining it to a beginner, actually even to an advanced user (and justifying it).

To a beginner, I would explain it thus: a rule re-runs if the evaluated value of its directives have changed. The console message will explain to you exactly why it changes, and you can use the --dry-run flag to check if something will re-run before committing to an actual run.

On inconsistency of behavior

I do not see your example with global variables to be an example of inconsistent behavior, because it is in line with the explanation I would provide a beginner (as above). Could you provide some other examples?

The user- and project-wide config files

There is no such thing, AFAIK (but please point me to it, as a project-wide config. There is not. There is no default profile in snakemake. Hence, this is not a working solution even.

My mistake! Snakemake calls them profiles: https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles. You can follow the instructions to set-up a user-wide profile which disables this default re-run behavior. Would you consider this a working solution?

The project-wide config file is config.yaml in the project directory.

Again, on inconsistency of behavior

I think this is your point of objection on most of my last reply. Again, it would be good if you could provide some examples. Now, I am not claiming that Snakemake does not have bugs, but I think that the behavior is consistent enough to be good after #1818 was fixed (before #1818 I might have agreed with you).

@Kevin-Brockers
Copy link

How does --rerun-triggers mtime and --rerun-incomplete get resolved?

Are files still re computed when they are incomplete when both flags are set?

@wangjiawen2013
Copy link

Hi,
I am still confused about Snakemake's re-run behaviors.

It is said that "Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job", according to snakemake tutorial (https://snakemake.readthedocs.io/en/stable/tutorial/basics.html). I tried some customized codes and datasets, it's not case!

Then I learn more about how snakmake re-run on https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/04-the_dag/index.html#:~:text=For%20all%20intermediate%20outputs%2C%20Snakemake%20applies%20the%20default,-F%20which%20runs%20the%20entire%20DAG%20every%20time. Also I tried some customized codes and datasets It's not the case too! Snakemake rerun without following the description!

So, what the underling re-run mechanism is ? Or I misunderstand it ?

@wangjiawen2013
Copy link

wangjiawen2013 commented Aug 1, 2023

And here is a toy example I worte (test.smk):

rule all:
input:
"/home/wangjw/data/work/flask/b.txt"

rule copy1:
input:
file = directory("/home/wangjw/data/work/flask")
output:
file = "data1/a.txt"
shell:
"cp {input.file}/a.txt {output.file}"

rule copy2:
input:
file = "data1/a.txt"
output:
file = "/home/wangjw/data/work/flask/b.txt"
shell:
"cp {input.file} {output.file}"

The a.txt was created before and located in the directory "/home/wangjw/data/work/flask". It showed "The flag 'directory' used in rule copy1 is only valid for outputs, not inputs." and ran again each time when I perform "snakemake -j 1 -s test.smk", although the outputs have already been existing! Is this behavior intentional or a bug ?

@johanneskoester
Copy link
Contributor Author

Hi @wangjiawen2013. Thanks for reporting. So what happens here is the following. You write into /home/wangjw/data/work/flask with rule copy2. This way, the modification date of the directory /home/wangjw/data/work/flask is changed. Hence, it wants to rerun the workflow again the next time you invoke it.

@moritzschaefer
Copy link

moritzschaefer commented Sep 7, 2023

Quick question regarding the default rerun-trigger on code changes:

I am using version control, and a quick change back and forth between branches can modify the modification time of snakemake-called scripts. E.g. I only run my very expensive pipeline when being in 'main', but experiment with script modifications in 'develop'. In develop, I freely edit my scripts. When switching back to 'main' to run my expensive pipeline, everything reruns, because the modification times of some scripts changed, despite their content being the same.

What's the intended workflow here? Should I keep another copy of my repository in which I do not switch back and forth between branches?

Would it be possible to compute hashes of all involved scripts, rather than relying on their modification time to decide whether rules/files are being rerun?

Generally, I go with the feeling of @jonas-eschle about the change: I understand that in theory, it is the 'correct' behavior to guarantee that the pipeline output files are consistent with all inputs (i.e. data AND code). However, in practice, it leads to more problems than it solves, unfortunately, at least for me.

@LukaP-BB
Copy link

@moritzschaefer The probably correct workflow here is to separate your development process from the production.
So, one folder where you basically have main + develop and stay on develop. Once you have validated your changes, merge develop into main and pull in the other folder.
Running production workflows from the same folder as where you test is risky and can lead to bad surprises.

@jonas-eschle
Copy link

@LukaP-BB for a larger production, a company, that is probably the way to go. But for any scientific workflow, that would be huge unnecessary overhead, as there is no such thing as "production", really. Why would that be required in the first place? To solve the shortcomings of a tool like snakemake that introduced a "feature" which actually required this?

Git is a great tool, an amazing tool. And it makes it extremely easy to be able to switch branches and run different things. Great. Let's not destroy that! (a workflow with CI and such makes of course sense for production, but maybe you're not on that scale)

Would it be possible to compute hashes of all involved scripts, rather than relying on their modification time to decide whether rules/files are being rerun?

@moritzschaefer isn't this ironic to ask? If there were only a tool that could spot differences well... Ah, there is! Git!
Let's not replace git. Please, snakemake, don't try to. But maybe... use it instead of reinventing the wheel if really needed? (or yeah, use hashes. Obviously. Why would you not?). But then, again, what about compiled files that are not spotted etc. All partially broken...

I am aware that it's hard sometimes to say "no" to a feature request. But it's what differentiates great and mediocre packages. This discussion here about a (I'll bet on it) "once-asked-by-a-complete-beginner-feature" wouldn't matter so much, if it weren't that snakemake is broken at so many levels and serious bugs, such as rule inheritance is completely broken, are not addressed at all.

@Zepeng-Mu
Copy link

How can I determine re-run by-and-only-by the presence of output file?.I remember this is the behavior when the first use snakemake.

@Zepeng-Mu
Copy link

Actually, ancient flag works for me

@wangjiawen2013
Copy link

It is said that "Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job (https://snakemake.readthedocs.io/en/stable/tutorial/basics.html)", however, I find that the re-run rule becomes confusing when setting all the outputs of each rule in rule all, given that there are dependencies among some rules (and thus there maybe redundance in rule all):

rule all:
    input:
        rule1/rule1.txt,
        rule2/rule2.txt,
        rule3/rule3.txt,
        rule4/rule4.txt,
        rule5/rule5.txt,
        rule6/rule6.txt

We can put all outputs in rule all and then run snakemake to get all the outputs, we also can specify the target files by hand when running snakemake, in this case we will get only the specified outputs and will get all the outputs by repeating this protocol. Are the re-run rule the same between these two methods ?

onadegibert added a commit to Helsinki-NLP/OpusDistillery that referenced this issue Jan 23, 2024
@johanneskoester
Copy link
Contributor Author

How can I determine re-run by-and-only-by the presence of output file?.I remember this is the behavior when the first use snakemake.

Use a profile and set rerun-triggers: [mtime] in there. This restores the former default. Or use it at the command line via --rerun-triggers mtime.

@Zepeng-Mu
Copy link

How can I determine re-run by-and-only-by the presence of output file?.I remember this is the behavior when the first use snakemake.

Use a profile and set rerun-triggers: [mtime] in there. This restores the former default. Or use it at the command line via --rerun-triggers mtime.

Thanks, I realized I have accidentally updated my input files, but not output files, so I used ancient function. Seems to work for me.

@jonas-eschle
Copy link

Use a profile and set rerun-triggers: [mtime] in there. This restores the former default.

@johanneskoester thanks for the explanation, but I think it currently does not: #2011

There is currently no way to restore the default as I see it? Is this just a bug in the code or is this intended behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests