Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fixme's in docstrings #9744

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

badsketch
Copy link
Contributor

Type of Changes

Type
πŸ› Bug fix
βœ“ ✨ New feature
πŸ”¨ Refactoring
πŸ“œ Docs

Description

Closes #9255

Previous PR discussion here: #9281

  • now an enhancement of existing fixme rather than a new message
  • check-fixme-in-docstring is the setting that enables it and defaults to false
  • also suggestions to improve the existing description: issue9255 - Detect FIXME words in docstringΒ #9281 (comment). I kind of like how the message is the TODO itself. As for the "Used when a warning note..." is that for a tooltip? I believe almost all checker messages have something like this, right?

Appreciate any feedback!

elif self.linter.config.check_fixme_in_docstring and self._is_docstring_comment(token_info):
docstring_lines = token_info.string.split("\n")
for line_no, line in enumerate(docstring_lines):
comment_text = line.removeprefix('"""').lstrip().removesuffix('"""') # trim '""""' and whitespace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeprefix() and removesuffix() are new in python 3.9. Dumb question, but how do I tell what version this project supports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After seeing 3.8 test suites fail, I'm guessing I need to make this compatible πŸ˜…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? Can't we just put the full docstring in the message for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like

"""
TODO msg1
TODO msg2
"""

this PR will create two TODO lint messages. If we put the full docstring for both, it might be confusing or overly wordy, no?

@badsketch
Copy link
Contributor Author

When running pylint -h, I get

Miscellaneous:
  BaseChecker for encoding issues.

  --notes <comma separated values>
                        List of note tags to take in consideration, separated by a comma. (default:
                        ('FIXME', 'XXX', 'TODO'))
  --notes-rgx <regexp>  Regular expression of note tags to take in consideration. (default: )
  --check-fixme-in-docstring <y or n>
                        Whether or not to search for fixme's in docstrings. (default: False)

Thoughts on updating the docstring to be "Checker for encoding issues and fixme notes"? instead of "BaseChecker for encoding issues."

This comment has been minimized.

@DanielNoord
Copy link
Collaborator

When running pylint -h, I get

Miscellaneous:
  BaseChecker for encoding issues.

  --notes <comma separated values>
                        List of note tags to take in consideration, separated by a comma. (default:
                        ('FIXME', 'XXX', 'TODO'))
  --notes-rgx <regexp>  Regular expression of note tags to take in consideration. (default: )
  --check-fixme-in-docstring <y or n>
                        Whether or not to search for fixme's in docstrings. (default: False)

Thoughts on updating the docstring to be "Checker for encoding issues and fixme notes"? instead of "BaseChecker for encoding issues."

Fine with me!

def _is_docstring_comment(self, token_info: tokenize.TokenInfo) -> bool:
return (
token_info.type == tokenize.STRING
and token_info.line.lstrip().startswith('"""')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a docstring can also start with '''. I'm wondering if this should live in this tokeniser checker as I think it is actually quite hard to recognise docstrings on tokens alone.

Have you considered doing it as a checker for nodes.Module, nodes.ClassDef, etc? Then you can just check if the regex is in the .doc attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a docstring can also start with '''

totally forgot about this, thanks!

Have you considered doing it as a checker for nodes.Module, nodes.ClassDef, etc? Then you can just check if the regex is in the .doc attribute.

I was considering the possibility docstrings could appear outside of nodes.Module and nodes.classDef, so I figured it's best to use the existing token stream to watch for all occurrences. However, you could argue it's not good python practice (?) in the first place to have docstrings outside modules/classes/methods. Agreed the PR doesn't use the safest heuristic to determine if it's a docstring.

So it looks like we could do

  1. update _is_docstring_comment() to also check startswith("'''")
  2. refactor to use nodes.Module, nodes.ClassDef, nodes.FunctionDef and we tighten the scope of docstring fixmes
  3. update tokenizer with a new docstring token similar to how we have a token.COMMENT type for a comment fixme. (Haven't looked too deep into this, will probably be higher effort)

Totally down to change it to 2, but would users claim false negatives when they try to create docstring fixme's outside of module/classes/function defs? Perhaps it would help if there were a lint message that recommends against docstrings outside of those places. Do we have something like that already?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-Sassoulas Opinion? I think trying option 1 for now might be fine, we can always refactor to 2 later on. I just thought I would raise the question to see if it was consciously ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the decision should be taken consciousely. I would have thougt that the node visitor implementation would be cleaner/terser, but maybe using the tokenizer is faster ? I expected less changes to be able to do docstrings' fixme check as we already have something working for comments ? Did we use the tokenizer for comments? I did not look very deep into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we already use tokenizer for comments. If I understand correctly, tokenizer is for instances where there's not a defined node for the check (eg, a comment can appear "anywhere" in the code, so we check for all tokens for that appearance). I tried to follow that logic with docstrings. Hence I piggy back on the tokenizer to examine for occurrences of """ and '''. If we decide we only wish to support docstrings in classes, functions, and methods, then I could use a node visitor on those 3 node types.

It may be easiest at this point to go with Option 1 at this point since it'd be straightforward to make the change in the PR. And as Daniel mentioned, we could refactor to 2 in the future. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings are not supposed to be everywhere in a module, that's useless statements otherwise, so a node approach would work. But if comments requires tokenizer, for consistency of aproachs let's go with 1)

Copy link

codecov bot commented Jun 30, 2024

Codecov Report

All modified and coverable lines are covered by tests βœ…

Project coverage is 95.80%. Comparing base (a5a77f6) to head (eea95d2).
Report is 44 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #9744      +/-   ##
==========================================
- Coverage   95.84%   95.80%   -0.05%     
==========================================
  Files         174      174              
  Lines       18878    18895      +17     
==========================================
+ Hits        18094    18102       +8     
- Misses        784      793       +9     
Files Coverage Ξ”
pylint/checkers/misc.py 90.41% <100.00%> (+1.88%) ⬆️

... and 20 files with indirect coverage changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome you got this to work!

Comment on lines 121 to 122
@set_config(check_fixme_in_docstring=True)
def test_docstring_with_message(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-Sassoulas Do you want these unittests? I would be fine with having only the functional tests and remove these. They feel like duplicates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, I use functional only almost all the time (terser / clearer once you know about functional).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that makes sense! Reverted changes to this file.

self._comment_fixme_pattern = re.compile(comment_regex, re.I)
if self.linter.config.check_fixme_in_docstring:
docstring_regex = rf"((\"\"\")|(\'\'\'))\s*({notes})(?=(:|\s|\Z))"
self._docstring_fixme_pattern = re.compile(docstring_regex, re.I)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to always set these two pattern attributes so that we don't get an AttributeError in unexpected ways. In the old version we also did this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I've removed the conditionals here and just let process_tokens() handle the logic of what to parse if check docstring for fixme setting is enabled.

Comment on lines 171 to 178
if line.startswith(('"""', "'''")):
line = line[3:]
line = line.lstrip()
if line.endswith(('"""', "'''")):
line = line[:-3]
if self._docstring_fixme_pattern.search(
'"""' + line.lower()
) or self._docstring_fixme_pattern.search("'''" + line.lower()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you have fixed the pattern, is this really necessary? I would prefer a pattern where we don't have to do a lot of complicated changes to the lines itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it was getting pretty messy. I've decided to rework the logic to use regex capture groups for both docstrings and comments (since in that case we were also sort of changing the line then stitching it together to examine with regex).

@badsketch
Copy link
Contributor Author

badsketch commented Jul 8, 2024

Apologies for the glacial pace of this PR 😬, really appreciate all the feedback! Some updates:

  • previously we were taking the token line and using string manipulation to get the fixme message, then rebuilding it so the regex could find it. I've since changed the logic to use just regex with capture groups instead. I've made this change for both docstrings and comments
  • To do this, I also had to split the docstring regex pattern between single and multiline. I tried it with just a single regex, but it got really ugly and I think the logic is much cleaner/intuitive if there's two. There's 3 regex patterns total now.
  • correct me if I'm wrong, but we were previously using re.search(). It seems if we're only searching the beginning of a string, then we can use re.match() which has better performance.

Let me know your thoughts!

This comment has been minimized.

@badsketch
Copy link
Contributor Author

Those primer tests are awesome! They thankfully caught something I missed when refactoring the comment regex pattern. I was no longer supporting cases like:

#     # TODO: something something

which is kind of curious as I wonder if that should've been supported in the first place? Regardless, I've updated it so it's supported and added that as an additional functional test.

This comment has been minimized.

This comment has been minimized.

@badsketch
Copy link
Contributor Author

In my last few messages, I mentioned updating how the message was extracted in comment-based fixme's so we wouldn't have to use string manipulation(198fd93, 1ce441c). I've decided to revert that change because it was causing a lot of primer test failures. I think this is due to how comment-based fixme's might be a little inconsistent in its current state:

#   TODO: msg1                                        
#   # TODO: msg2
# something # TODO: msg3
# something TODO: msg4

results in:

test.py:1:1: W0511: TODO: msg1 (fixme)
test.py:2:1: W0511: # TODO: msg2 (fixme)
test.py:3:1: W0511: something # TODO: msg3 (fixme)

I wasn't able to replicate what we're doing today using regex capture groups so in favor of not causing any disruptions, I decided to go back to the current method.

If we're open to standardizing some of the decided behavior maybe in the future and allowing changes to the primer tests, I'm also down to discuss!

@DanielNoord
Copy link
Collaborator

@badsketch With the danger of introducing more back and forth: would you be willing to create a proposal for this standardization and apply it to this PR? We can then see the test output and determine whether we are okay with that. That makes it a lot easier to discuss such standardization (with having some examples).

@badsketch
Copy link
Contributor Author

@DanielNoord Sure thing, is there a formal way of making these proposals for Pylint features and do I submit it somewhere? Or do you mean try to consider all the usage out there and come back with a more detailed comment?

@DanielNoord
Copy link
Collaborator

No, I meant: just write the code as you would want and make the tests pass. If the test output is acceptable I would be okay with accepting your proposed code.

I don't think we need a full proposal, just the code as you would propose to write it and then being able to see what changes that would give to the test output.

Copy link
Contributor

πŸ€– Effect of this PR on checked open source code: πŸ€–

Effect on astroid:
The following messages are no longer emitted:

  1. fixme:
    # TODO: This should return an Uninferable as this would raise
    https://github.com/pylint-dev/astroid/blob/6db3a60553ff538a936d5dda23d67a3924a57f45/astroid/brain/brain_dataclasses.py#L186

Effect on black:
The following messages are no longer emitted:

  1. fixme:
    #assert ilabel not in first # XXX failed on <> ... !=
    https://github.com/psf/black/blob/721dff549362f54930ecc038218dcc40e599a875/src/blib2to3/pgen2/pgen.py#L80

Effect on music21:
The following messages are no longer emitted:

  1. fixme:
    # TODO: this file does not import correctly due to first/second
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/test/test_repeat.py#L475
  2. fixme:
    # TODO: use the linter, reference DOESN'T have to be passed in
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/alpha/analysis/hasher.py#L485
  3. fixme:
    # TODO: attach \noBeam to note if it is the last note
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/lily/translate.py#L1174
  4. fixme:
    # TODO: Something with 4.2 Repetitions; not in hum2xml
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2602
  5. fixme:
    # TODO: Find out what timeBase means; not in hum2xml
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2606
  6. fixme:
    # TODO: make staff numbers relevant; not in hum2xml
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2610
  7. fixme:
    # TODO: Turn back on when a smaller work is found...
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/musedata/translate.py#L584
  8. fixme:
    # TODO: column 17 self.src[16] defines the graphic note type
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/musedata/__init__.py#L339
  9. fixme:
    # TODO: write
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/stream/base.py#L13034
  10. fixme:
    # TODO: write
    https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/stream/base.py#L13038

Effect on pytest:
The following messages are no longer emitted:

  1. fixme:
    path.strpath # XXX svn?
    https://github.com/pytest-dev/pytest/blob/16cdacc5a373984bc22c300d76e633c3a44bcd35/src/_pytest/_py/path.py#L193
  2. fixme:
    # XXX
    https://github.com/pytest-dev/pytest/blob/16cdacc5a373984bc22c300d76e633c3a44bcd35/src/_pytest/_code/code.py#L913

Effect on pandas:
The following messages are no longer emitted:

  1. fixme:
    e.g. Sparse[bool, False] # TODO: no test cases get here
    https://github.com/pandas-dev/pandas/blob/61e209e4e9b628e997a648e12e24ac47fa3e1e26/pandas/core/algorithms.py#L158

Effect on sentry:
The following messages are no longer emitted:

  1. fixme:
    type: ignore[assignment] # XXX: clobbers Serializer.fields
    https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/discover/endpoints/serializers.py#L33
  2. fixme:
    type: ignore[assignment] # XXX: clobbers Serializer.fields
    https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/discover/endpoints/serializers.py#L156
  3. fixme:
    type: ignore[assignment] # XXX: clobbering Serializer.fields
    https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/api/serializers/rest_framework/dashboard.py#L143

This comment was generated for commit eea95d2

@badsketch
Copy link
Contributor Author

@DanielNoord
Gotcha, I've updated the functional tests for both comment and docstring fixme's to serve as my final proposal. The latest primer test results also align with what I've proposed. They boil down to:

  • comment fixme's must start with a # then any number of spaces (and only spaces), the fixme keyword, followed by a message
# TODO valid
#              TODO valid
# invalid TODO msg
  • single line docstring fixme's must start with 3 single/double quotes, then any number of spaces (and only spaces), the fixme keyword, followed by a message
'''TODO valid'''
""" TODO valid """
''' invalid TODO msg """
  • multi-line docstring fixme's are within a docstring block, but must be at the beginning of the line. Any number of indentations before it are okay.
'''
TODO valid
invalid TODO msg
'''
def foo():
      """
      TODO valid
      """

The majority of changed primer tests are for comment fixme's that no longer emit. They appear to be fixme's that are part of a larger section of code that was commented out, suggesting the fixme is no longer valid. I think it makes sense for those to no longer be emitted.

Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Pierre-Sassoulas Do you want to give this a review as well?

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @badsketch !

I'm wondering why the regex contains rf"((\"\"\")|(\'\'\')), but we also do (token_info.line.lstrip().startswith(('"""', "'''"))) somewhere else ? My intuition would be that we could just search for the user defined "TODO" regex in comments and docstrings ? Is this because we want the option to not raise for docstring (for semver/compat with the old behavior) so we need to distinguish the two internally ? I didn't make any benchmarks for this but I think this kind of regex executed on each token can have huge performance impact.

@@ -0,0 +1,3 @@
Add ability for `fixme` check to also search through docstrings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Add ability for `fixme` check to also search through docstrings.
The `fixme` check can now search through docstrings as well as comments, by using ``check-fixme-in-docstring = true`` in the ``[tool.pylint.miscellaneous]`` section.

@badsketch
Copy link
Contributor Author

@Pierre-Sassoulas
That's a fair point. I think I can optimize it a little by making the((\"\"\")|(\'\'\')) non-capture groups. Would that suffice?

My intuition would be that we could just search for the user defined "TODO" regex in comments and docstrings ? Is this because we want the option to not raise for docstring (for semver/compat with the old behavior) so we need to distinguish the two internally ?

Hm, I'm little unclear what you mean. We're using 3 regexes (comment, single-line docstring, multi-line docstring) for clarity since it's hard to capture all of it in one regex. Also, so that, yes, we don't need to execute a more expensive docstring regex if the option is switched off. Not sure if I answered your question :/

To your point above, we could check if a TODO is present so we don't bother checking for a regex if that keyword isn't there:

    def _is_multiline_docstring(self, token_info: tokenize.TokenInfo) -> bool:
        return (
            token_info.type == tokenize.STRING
            and (token_info.line.lstrip().startswith(('"""', "'''")))
+           and 'TODO' in token_info.line
            and "\n" in token_info.line.rstrip()
        )

What might complicate this is note-rgx is enabled, then we'd be executing a regex anyways

    def _is_multiline_docstring(self, token_info: tokenize.TokenInfo) -> bool:
        return (
            token_info.type == tokenize.STRING
            and (token_info.line.lstrip().startswith(('"""', "'''")))
+           and 'TODO' in token_info.line.search(TODO|XXX|FIXME|{notes.rgx|)
            and "\n" in token_info.line.rstrip()
        )

I didn't make any benchmarks for this but I think this kind of regex executed on each token can have huge performance impact.

I'd be willing to run a profile to make a more convincing argument. Would following this guide https://pylint.readthedocs.io/en/stable/development_guide/contributor_guide/profiling.html be what you had in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

W0511: Doesn't detect TODO in docstrings
3 participants