Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakefmt comments parsing has unintended side-effects #85

Closed
dlaehnemann opened this issue Nov 18, 2020 · 31 comments
Closed

snakefmt comments parsing has unintended side-effects #85

dlaehnemann opened this issue Nov 18, 2020 · 31 comments
Assignees
Labels
bug Something isn't working

Comments

@dlaehnemann
Copy link

I just tried to get snakefmt --check to work for snakemake-wrappers: snakemake/snakemake-wrappers#270

However, the wrappers sport a large variety of comments in the Snakefiles, with different arrangements that each make sense for different settings. snakefmt rearranges them all in a standardized way, but in the process messes up their arrangement, oftentimes severing the clear association between some rule keyword and the comment. Here's the commit with which I tried to revert and fix those wrong adjustments -- it provides a good sample of how people may want to use comments in Snakefiles:
snakemake/snakemake-wrappers@c82e430

@johanneskoester
Copy link
Contributor

Ok, I'll deactivate snakefmt on snakemake-wrappers again for now.

@mbhall88
Copy link
Member

Ok, so what you're saying is you would like

rule bedtools_sort_bed:
    input:
        in_file="a.bed", # an optional sort file can be set as genomefile by the variable genome or
         # as fasta index file by the variable faidx
        genome="dummy.genome",

to be left as is and not turned into

rule bedtools_sort_bed:
    input:
        in_file="a.bed",
        # an optional sort file can be set as genomefile by the variable genome or
        # as fasta index file by the variable faidx
        genome="dummy.genome",

This is also brings up something that has been low-key bugging me for a while. PEP8 specifies that inline comments should be separated from the statement by at least two spaces.

For example, both black and snakefmt correctly diff this following

-foo = "bar" # a comment
+foo = "bar"  # a comment

(i.e. the extra space between the statement and the comment). But snakefmt doesn't do the same inside a rule

# test.smk
rule foo:
    input:
        bar="file.txt", # comment
        # another comment
$ snakefmt -l 10 --diff test.smk
  rule foo:
      input:
          bar="file.txt", # comment
-         # another comment
+          # another comment
? +

This also makes me realise @dlaehnemann that I am having a little difficulty recreating your issue.

Could you please provide a concrete example we can use for testing purposes showing actual and expected results for a small snakefile?

@mbhall88 mbhall88 added the bug Something isn't working label Nov 19, 2020
@johanneskoester
Copy link
Contributor

Here we go.

Correct:

rule bamtools_filter:
    input:
        "{sample}.bam",
    output:
        "filtered/{sample}.bam",
    params:
        # optional parameters
        tags = [ "NM:<4", "MQ:>=10" ],  # list of key:value pair strings
        min_size="-2000",
        max_size="2000",
        min_length="10",
        max_length="20",
        # to add more optional parameters (see bamtools filter --help):
        additional_params='-mapQuality ">=0" -isMapped "true"',
    log:
        "logs/bamtools/filtered/{sample}.log",
    wrapper:
        "master/bio/bamtools/filter"

snakefmt:

rule bamtools_filter:
    input:
        "{sample}.bam",
    output:
        "filtered/{sample}.bam",
    params:
        tags=["NM:<4", "MQ:>=10"], # optional parameters
         # list of key:value pair strings
        min_size="-2000",
        max_size="2000",
        min_length="10",
        max_length="20", # to add more optional parameters (see bamtools filter --help):
        additional_params='-mapQuality ">=0" -isMapped "true"',
    log:
        "logs/bamtools/filtered/{sample}.log",
    wrapper:
        "master/bio/bamtools/filter"

as can be seen, the single line comments that should refer to the lines below are moved to the lines above, which changes their semantics.

@dlaehnemann
Copy link
Author

Yeah, sorry for the unspecific dump, yesterday. I had spent hours fixing up that pull request, so simply wanted to note the issue without spending more time on it. So I am also sorry for the brisk tone of that issue...

@johanneskoester provided the most important recurring issue with the comment formatting, thanks for this more constructive approach!
I'll go through the commit again, now, and provide a few more comment use cases which I think make sense and should not be changed.

@dlaehnemann
Copy link
Author

For this one, I am actually not 100% sure if this is feasible with a linter, as it goes against the PEP you mentioned above. But I nevertheless want to document it here:

Wanted layout:

rule chm_eval_kit:                           # [hide]
    output:                                  # [hide]
        directory("resources/chm-eval-kit"), # [hide]
    params:                                  # [hide]
        tag="v0.5",                          # [hide]
        version="20180222",                  # [hide]
    log:                                     # [hide]
        "logs/chm-eval-kit.log",             # [hide]
    cache: True                              # [hide]
    wrapper:                                 # [hide]
        "master/bio/benchmark/chm-eval-kit"  # [hide]
                                             # [hide]

current snakefmt

rule chm_eval_kit: # [hide]
    output: # [hide]
        directory("resources/chm-eval-kit"), # [hide]
    params: # [hide]
        tag="v0.5", # [hide]
        version="20180222", # [hide]
    log: # [hide]
        "logs/chm-eval-kit.log", # [hide]
    cache: True # [hide]
    wrapper: # [hide]
        "master/bio/benchmark/chm-eval-kit" # [hide]
         # [hide]

@dlaehnemann
Copy link
Author

Here is an example, where current snakefmt even breaks the rule syntax (by commenting out the thread: entry's value when pulling up the second row with a comment):

wanted layout

    threads:  # Samtools takes additional threads through its option -@
        8     # This value - 1 will be sent to -@

current snakefmt

    threads: # Samtools takes additional threads through its option -@ 8 # This value - 1 will be sent to -@

If reformatting this would be wanted, an alternative formatting could be:

    threads: 8 # Samtools takes additional threads through its option -@; This value - 1 will be sent to -@

@dlaehnemann
Copy link
Author

Here, snakefmt thinks that the comments with a completely different indentation level do not belong to the indentation level they are at (and the resources spec that follows), but to the preceding inline comment.

wanted format

    params:
        extra="", # optional
        java_opts="", # optional
    # optional specification of memory usage of the JVM that snakemake will respect with global
    # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
    # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
    # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
    resources:
        mem_mb=1024,

current snakefmt

    params:
        extra="", # optional
        java_opts="", # optional
         # optional specification of memory usage of the JVM that snakemake will respect with global
         # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
         # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
         # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
    resources:
        mem_mb=1024,

@dlaehnemann
Copy link
Author

Similar to the previous example, but with commented out extra params, which is a good way to showcase possible further extra params in an example rule for a Snakemake wrapper. Here, snakefmt adds an extra whitespace that does not make sense for the use-case.

wanted format

    params:
        extra="", # optional
        java_opts="", # optional
        #spark_runner="",  # optional, local by default
        #spark_master="",  # optional
        #spark_extra="", # optional

current snakefmt

    params:
        extra="", # optional
        java_opts="", # optional
         #spark_runner="",  # optional, local by default
         #spark_master="",  # optional
         #spark_extra="", # optional

@dlaehnemann dlaehnemann changed the title snakefmt rearranges comments, often messing them up snakefmt comments parsing has unintended side-effects Nov 19, 2020
@dlaehnemann
Copy link
Author

@mbhall88 I think this list now gives a representative example for all the comment use cases I had to fix manually after running snakefmt on the snakemake-wrappers to avoid errors and wrong associations. I hope these are usable as test cases in fine-tuning snakefmt, even though I reduced them to the rule parts that are necessary to showcase the respective issue. But the rest of the rule could always be something generic.

@bricoletc
Copy link
Collaborator

Thank you for the enormous detective work @dlaehnemann and for the example @johanneskoester . We can now work with these concrete examples. we had clearly not given comment treatment a hard enough look, let's work on that now!

@dlaehnemann
Copy link
Author

Good to know, that these unexpected three hours yesterday were useful after all! :D
And I guess snakemake wrappers was simply a very large and diverse collection of all kinds of comment use cases out there...

And thanks for providing this, this will be incredibly useful for the snakemake wrappers and snakemake workflows projects, catching lots of unwanted stuff automatically in pull requests! So time invested here is definitely worth it in the long run!

@mbhall88
Copy link
Member

Thanks so much for this!! I'm amazed we've gotten this far without realising this.

I guess the best approach is to leave comments as they are - unless they have less than 2-space from inline statement, in which case they will just have spacing added.

@dlaehnemann
Copy link
Author

Yeah, to start with, that is probably the easiest and cleanest solution.

Should we at some point have any strong opinions about how comments should be formatted, we should use the above test cases to check that it keeps (most of) them as intended. But any such solution will probably have to reason about comment indentations (as in the # [hide] comments example and about alignment of comments with other content (as in the Java memory management comment example).

@mbhall88
Copy link
Member

Should we at some point have any strong opinions about how comments should be formatted, we should use the above test cases to check that it keeps (most of) them as intended. But any such solution will probably have to reason about comment indentations (as in the # [hide] comments example and about alignment of comments with other content (as in the Java memory management comment example).

I think we should just try and stick to what PEP8/black specifies? Which would appease both of the examples you mention I think.

@bricoletc
Copy link
Collaborator

For this one, I am actually not 100% sure if this is feasible with a linter, as it goes against the PEP you mentioned above. But I nevertheless want to document it here:

Wanted layout:

rule chm_eval_kit:                           # [hide]
    output:                                  # [hide]
        directory("resources/chm-eval-kit"), # [hide]
    params:                                  # [hide]
        tag="v0.5",                          # [hide]
        version="20180222",                  # [hide]
    log:                                     # [hide]
        "logs/chm-eval-kit.log",             # [hide]
    cache: True                              # [hide]
    wrapper:                                 # [hide]
        "master/bio/benchmark/chm-eval-kit"  # [hide]
                                             # [hide]

current snakefmt

rule chm_eval_kit: # [hide]
    output: # [hide]
        directory("resources/chm-eval-kit"), # [hide]
    params: # [hide]
        tag="v0.5", # [hide]
        version="20180222", # [hide]
    log: # [hide]
        "logs/chm-eval-kit.log", # [hide]
    cache: True # [hide]
    wrapper: # [hide]
        "master/bio/benchmark/chm-eval-kit" # [hide]
         # [hide]

Coming back to this, I don't think it is trivial to leave the comment alignment.
The problem is that we rely on python's tokenize library to process the text, and the tokenizer detects comments but ignores the preceding whitespace.

In fact for this situation black reformat like snakefmt does:

class p:          # [hide]
    elem1 = 2     # [hide]
    elem2 = 222   # [hide]

becomes

class p:  # [hide]
    elem1 = 2  # [hide]
    elem2 = 222  # [hide]

I would argue we should probably stick with that

@dlaehnemann
Copy link
Author

Jup, that's what I figured and why I added the caveat when posting this. It's a shame, because the alignment does make it a lot more readable, but for the formatter to see this and then keep it that way does sound to complicated to implement quickly. So I agree that this is not worth investing lots of energy.

Maybe one workaround for such examples would be to include black's comment-based tag that turns off formatting for a certain region (# fmt off and # fmt on):
https://github.com/psf/black#the-black-code-style

We will just have to remember to include this in the respective snakemake-wrappers before applying snakefmt again. Let me know once you're done with updating the comments handling, and I'd be glad to work on a new snakemake-wrappers PR to implement snakefmt over there, and will try to check up on all my lessons learned from last time around... :D

@bricoletc
Copy link
Collaborator

Adding #fmt off/on flags is an interesting idea! It would need to be a new snakefmt feature as we have our own parser to deal with snakemake syntax. Would you like to raise that as an issue @dlaehnemann ? As it is your idea; else I can.

I'm working on all the other failing test cases you have introduced, will let you know when those are done :) its so helpful to have you testing this out on all these real-world files.

@dlaehnemann
Copy link
Author

dlaehnemann commented Dec 1, 2020

Adding #fmt off/on flags is an interesting idea! It would need to be a new snakefmt feature as we have our own parser to deal with snakemake syntax. Would you like to raise that as an issue @dlaehnemann ? As it is your idea; else I can.

Done, see issue #86 .

I'm working on all the other failing test cases you have introduced, will let you know when those are done :) its so helpful to have you testing this out on all these real-world files.

Sounds like a fun game of whack-a-mole. Enjoy, I guess?

bricoletc added a commit to bricoletc/snakefmt that referenced this issue Dec 1, 2020
* Add PEP8 2-space inline comment spacing, with unit test
* Remove leading space in multiple comments in parameters (snakemake#85)
* Add 4 failing test cases taken from snakemake#85
@bricoletc
Copy link
Collaborator

Here is an example, where current snakefmt even breaks the rule syntax (by commenting out the thread: entry's value when pulling up the second row with a comment):

wanted layout

    threads:  # Samtools takes additional threads through its option -@
        8     # This value - 1 will be sent to -@

current snakefmt

    threads: # Samtools takes additional threads through its option -@ 8 # This value - 1 will be sent to -@

If reformatting this would be wanted, an alternative formatting could be:

    threads: 8 # Samtools takes additional threads through its option -@; This value - 1 will be sent to -@

I'd like to propose an alternative solution:

Given:

rule all:
    # Comm0
    # Comm1
    threads: # Comm2
        8  # Comm3

include: # Comm1
    t

Produce:

rule all:
    # Comm0
    # Comm1
    # Comm2
    threads: 8  # Comm3


# Comm1
include: t

Given the inline comment documents the keyword (eg here # Comm2 documents threads) I think it's reasonable to put it above for keywords that get formatted inline, like include and threads. One caveat is # Comm0 and # Comm1 here could be documenting rule all not threads, but then the user can merge # Comm2 and #Comm3 themselves; I'd rather not force comment merging.

Would you agree @dlaehnemann ?

@dlaehnemann
Copy link
Author

I definitely agree, that comment merging is difficult and thus I expect it to be error-prone. But moving each comment up one line could be similarly problematic. As you are pointing out, Comm0 and Comm1 could be referring to the rule, and then Comm2 would be erroneously aligned with them. Similarly, Comm2 and Comm3 could be intended as a block comment with continuous text and breaking their alignment could lead to confusion. So with the given variety of commenting styles out there, reasoning about these in a formatter and trying to systematically change them seems difficult.

An alternative could be to clearly define an intended Snakemake commenting style and document this everywhere (docs of Snakemake, snakemake-wrappers, snakemake-workflows, etc.). We would then have to go through existing projects once, to implement this style there. But even then, I think having the formatter change comment layout (beyond what black would do) is probably going to be problematic.

@bricoletc
Copy link
Collaborator

bricoletc commented Dec 3, 2020

Similarly, Comm2 and Comm3 could be intended as a block comment with continuous text and breaking their alignment could lead to confusion.

Could you expand on what you mean by this @dlaehnemann ?
And I think I see your points, but for this specific problem what do you reckon we should do, merge the comments?

Perhaps @johanneskoester has a guiding opinion?

@bricoletc
Copy link
Collaborator

Also I'm not sure there exists a valid Python statement where you have

statement : #comment 
    statement #comment

(Possibly because Python statements like class or if, that are followed by :, expect a newline before next statement)
So not sure what black would do anyway...seems like this could be uncharted territory.

@dlaehnemann
Copy link
Author

Yeah, that was not really clear. Here's an example:

    threads:  # this is a very, very, very long and
        8  # useless block comment about threads

I also came up with much more extreme case, but I think we won't be able to define a universally useful behaviour for all the comment types out there.

But I double-checked on the general python PEP 8 on comments, and maybe we could simply follow the very basic guidance there:
https://www.python.org/dev/peps/pep-0008/#comments

This would mean that:

  • block comments are always assumed to precede whatever they are describing
  • inline comments (i.e. a # comment after some code in the same line) are always assumed to describe that line and should have at least two spaces after the code (this does imply, that we could leave comments with more indentation wherever they are...)

And then the formatter could either emit warnings, if it finds funny stuff that doesn't seem to conform to this, or changes this according to the above assumptions. So e.g.:

1    threads:
2        8
3        # some weirdly indented comment, could be about the 8 above or the shell directive below
4    shell:
5        "do_something --arg1 test"

This would throw Warning: Block comment in line 3 not properly aligned with following 'shell' keyword, which it is expected to describe. For python comment formatting, see: https://www.python.org/dev/peps/pep-0008/#comments. Or, in conforming to the PEP 8, the formatter would adjust the indentation to that of shell:.

@bricoletc
Copy link
Collaborator

I see what you mean. For the example on the threads block comment, I think we have to assume that what comes after the 8 is the inline comment associated with threads, and place the other comment above threads. If there are yet more comments above threads, they are assumed to be also block comments describing threads. I'd happily issue a warning that such a reorganisation is happening.

For your second example I agree issuing a warning would also make sense- i wouldn't be keen on changing the comment indent though. given how the parser is set up its a lot of work to output all the information (eg it is expected to describe shell), I could at least add a warning referring to PEP8!

@dlaehnemann
Copy link
Author

So for a keyword like threads: (with the colon) followed directly by a comment, you would assume that this belongs above the keyword? I guess that should usually make sense, but it does mean changing what the user specified as an inline comment to a block comment. So yes, maybe issue a warning whenever that happens.

My second example, with the weird indentation, is actually a safe bet to adjust the indentation, based on PEP 8. Maybe also issue a warning that this was done. But no need to include too much info beyond the comment itself. If a line number can be issued, that's helpful, but otherwise the text of the comment itself should do -- this at least enables a quick search by the user.

One change I would actually suggest, is to only adjust the spacing of inline comments, if they are below the two whitespaces distance from the code. This way, aligned inline comments across multiple lines would stay untouched and the formatting would nevertheless conform to PEP 8. Also, this would remove the need of introducing the on/off tags suggested in issue #86 .

@bricoletc
Copy link
Collaborator

bricoletc commented Dec 3, 2020

So for a keyword like threads: (with the colon) followed directly by a comment, you would assume that this belongs above the keyword?

I would only place it above the keyword if it occurs before the value: so

threads: 8  # comment

The inline comment stays, but

threads: #comment 
    8

The comment goes above, with a warning

My second example, with the weird indentation, is actually a safe bet to adjust the indentation, based on PEP 8.

I would prefer issuing a warning only; PEP says Block comments generally apply to some (or all) code that follows them, ; generally applies seems like a soft requirement.
In addition, black does this:

Before

class p:
    # Comment
    elem1 = 2
    # Woopsy
# Woopsy2
class a:
    pass
class p:
    # Comment
    elem1 = 2
    # Woopsy


# Woopsy2
class a:
    pass

One change I would actually suggest, is to only adjust the spacing of inline comments,

Again I would prefer to stick to what black does there. Do you think that people often line up comments like the [hide] example you spotted (what does that hide mean actually? :p)

Sorry for being pedantic. Also at some level this comes down to opinions..

@bricoletc bricoletc self-assigned this Dec 4, 2020
@bricoletc
Copy link
Collaborator

Branch dev now has new functionality which I believe addresses all the breaking examples provided in this issue.

@dlaehnemann would you like to run tip of dev on the snakemake-wrappers files and check if things look good now? If that's too traumatic given the issues that came up last time, I can give it a go, though you might spot more clearly what's improved since last time. Amyway, I think progress has definitely been made!

@dlaehnemann
Copy link
Author

Will do, I'll try it out ASAP. And simply report back here, if anything fishy still turns up, instead of fixing for hours... ;)

bricoletc added a commit to bricoletc/snakefmt that referenced this issue Dec 14, 2020
* Add PEP8 2-space inline comment spacing, with unit test
* Remove leading space in multiple comments in parameters (snakemake#85)
* Add 4 failing test cases taken from snakemake#85
@bricoletc
Copy link
Collaborator

Hey @dlaehnemann any news on this? I've run it through a few of the wrappers and was broadly satisfied. I could push out a new release and rework any incoming issues as they come later on?

@dlaehnemann
Copy link
Author

Sorry @bricoletc , this got lost on the todo list. But I should find time at the end of the week.

If you're generally satisfied with the results on some wrappers, a release would be nice -- this way we could switch snakefmt on for the whole snakemake wrappers repo immediately, if changing the formatting works as we want it. Otherwise, we'll just have to iterate to another release. So a release would be welcome!

@bricoletc
Copy link
Collaborator

bricoletc commented Dec 15, 2020

No problem! OK, release 0.2.5 is made 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants