Support fixme's in docstrings #9744

badsketch · 2024-06-22T07:49:54Z

Type of Changes

	Type
	🐛 Bug fix
✓	✨ New feature
	🔨 Refactoring
	📜 Docs

Description

Closes #9255

Previous PR discussion here: #9281

now an enhancement of existing fixme rather than a new message
check-fixme-in-docstring is the setting that enables it and defaults to false
also suggestions to improve the existing description: issue9255 - Detect FIXME words in docstring #9281 (comment). I kind of like how the message is the TODO itself. As for the "Used when a warning note..." is that for a tooltip? I believe almost all checker messages have something like this, right?

Appreciate any feedback!

badsketch · 2024-06-22T07:51:57Z

pylint/checkers/misc.py

+            elif self.linter.config.check_fixme_in_docstring and self._is_docstring_comment(token_info):
+                docstring_lines = token_info.string.split("\n")
+                for line_no, line in enumerate(docstring_lines):
+                    comment_text = line.removeprefix('"""').lstrip().removesuffix('"""')  # trim '""""' and whitespace


removeprefix() and removesuffix() are new in python 3.9. Dumb question, but how do I tell what version this project supports?

After seeing 3.8 test suites fail, I'm guessing I need to make this compatible 😅

Do we really need this? Can't we just put the full docstring in the message for now?

For a case like

""" TODO msg1 TODO msg2 """

this PR will create two TODO lint messages. If we put the full docstring for both, it might be confusing or overly wordy, no?

badsketch · 2024-06-22T07:55:30Z

When running pylint -h, I get

Miscellaneous:
  BaseChecker for encoding issues.

  --notes <comma separated values>
                        List of note tags to take in consideration, separated by a comma. (default:
                        ('FIXME', 'XXX', 'TODO'))
  --notes-rgx <regexp>  Regular expression of note tags to take in consideration. (default: )
  --check-fixme-in-docstring <y or n>
                        Whether or not to search for fixme's in docstrings. (default: False)

Thoughts on updating the docstring to be "Checker for encoding issues and fixme notes"? instead of "BaseChecker for encoding issues."

DanielNoord · 2024-06-23T10:43:38Z

When running pylint -h, I get

Miscellaneous:
  BaseChecker for encoding issues.

  --notes <comma separated values>
                        List of note tags to take in consideration, separated by a comma. (default:
                        ('FIXME', 'XXX', 'TODO'))
  --notes-rgx <regexp>  Regular expression of note tags to take in consideration. (default: )
  --check-fixme-in-docstring <y or n>
                        Whether or not to search for fixme's in docstrings. (default: False)

Thoughts on updating the docstring to be "Checker for encoding issues and fixme notes"? instead of "BaseChecker for encoding issues."

Fine with me!

DanielNoord · 2024-06-23T10:45:09Z

pylint/checkers/misc.py

+    def _is_docstring_comment(self, token_info: tokenize.TokenInfo) -> bool:
+        return (
+            token_info.type == tokenize.STRING
+            and token_info.line.lstrip().startswith('"""')


Note that a docstring can also start with '''. I'm wondering if this should live in this tokeniser checker as I think it is actually quite hard to recognise docstrings on tokens alone.

Have you considered doing it as a checker for nodes.Module, nodes.ClassDef, etc? Then you can just check if the regex is in the .doc attribute.

Note that a docstring can also start with '''

totally forgot about this, thanks!

Have you considered doing it as a checker for nodes.Module, nodes.ClassDef, etc? Then you can just check if the regex is in the .doc attribute.

I was considering the possibility docstrings could appear outside of nodes.Module and nodes.classDef, so I figured it's best to use the existing token stream to watch for all occurrences. However, you could argue it's not good python practice (?) in the first place to have docstrings outside modules/classes/methods. Agreed the PR doesn't use the safest heuristic to determine if it's a docstring.

So it looks like we could do

update _is_docstring_comment() to also check startswith("'''")

refactor to use nodes.Module, nodes.ClassDef, nodes.FunctionDef and we tighten the scope of docstring fixmes

update tokenizer with a new docstring token similar to how we have a token.COMMENT type for a comment fixme. (Haven't looked too deep into this, will probably be higher effort)

Totally down to change it to 2, but would users claim false negatives when they try to create docstring fixme's outside of module/classes/function defs? Perhaps it would help if there were a lint message that recommends against docstrings outside of those places. Do we have something like that already?

@Pierre-Sassoulas Opinion? I think trying option 1 for now might be fine, we can always refactor to 2 later on. I just thought I would raise the question to see if it was consciously ignored.

I think the decision should be taken consciousely. I would have thougt that the node visitor implementation would be cleaner/terser, but maybe using the tokenizer is faster ? I expected less changes to be able to do docstrings' fixme check as we already have something working for comments ? Did we use the tokenizer for comments? I did not look very deep into this.

Yeah we already use tokenizer for comments. If I understand correctly, tokenizer is for instances where there's not a defined node for the check (eg, a comment can appear "anywhere" in the code, so we check for all tokens for that appearance). I tried to follow that logic with docstrings. Hence I piggy back on the tokenizer to examine for occurrences of """ and '''. If we decide we only wish to support docstrings in classes, functions, and methods, then I could use a node visitor on those 3 node types.

It may be easiest at this point to go with Option 1 at this point since it'd be straightforward to make the change in the PR. And as Daniel mentioned, we could refactor to 2 in the future. Thoughts?

Docstrings are not supposed to be everywhere in a module, that's useless statements otherwise, so a node approach would work. But if comments requires tokenizer, for consistency of aproachs let's go with 1)

codecov · 2024-06-30T20:13:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.80%. Comparing base (a5a77f6) to head (eea95d2).
Report is 44 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9744      +/-   ##
==========================================
- Coverage   95.84%   95.80%   -0.05%     
==========================================
  Files         174      174              
  Lines       18878    18895      +17     
==========================================
+ Hits        18094    18102       +8     
- Misses        784      793       +9

Files	Coverage Δ
pylint/checkers/misc.py	`90.41% <100.00%> (+1.88%)`	⬆️

... and 20 files with indirect coverage changes

DanielNoord

Awesome you got this to work!

DanielNoord · 2024-07-01T06:55:19Z

tests/checkers/unittest_misc.py

+    @set_config(check_fixme_in_docstring=True)
+    def test_docstring_with_message(self) -> None:


@Pierre-Sassoulas Do you want these unittests? I would be fine with having only the functional tests and remove these. They feel like duplicates

I agree with you, I use functional only almost all the time (terser / clearer once you know about functional).

Gotcha, that makes sense! Reverted changes to this file.

DanielNoord · 2024-07-01T06:56:59Z

pylint/checkers/misc.py

+            self._comment_fixme_pattern = re.compile(comment_regex, re.I)
+            if self.linter.config.check_fixme_in_docstring:
+                docstring_regex = rf"((\"\"\")|(\'\'\'))\s*({notes})(?=(:|\s|\Z))"
+                self._docstring_fixme_pattern = re.compile(docstring_regex, re.I)


I think we need to always set these two pattern attributes so that we don't get an AttributeError in unexpected ways. In the old version we also did this.

Gotcha, I've removed the conditionals here and just let process_tokens() handle the logic of what to parse if check docstring for fixme setting is enabled.

DanielNoord · 2024-07-01T06:58:01Z

pylint/checkers/misc.py

+                    if line.startswith(('"""', "'''")):
+                        line = line[3:]
+                    line = line.lstrip()
+                    if line.endswith(('"""', "'''")):
+                        line = line[:-3]
+                    if self._docstring_fixme_pattern.search(
+                        '"""' + line.lower()
+                    ) or self._docstring_fixme_pattern.search("'''" + line.lower()):


Now that you have fixed the pattern, is this really necessary? I would prefer a pattern where we don't have to do a lot of complicated changes to the lines itself.

True, it was getting pretty messy. I've decided to rework the logic to use regex capture groups for both docstrings and comments (since in that case we were also sort of changing the line then stitching it together to examine with regex).

…get the fixme message

badsketch · 2024-07-08T02:53:19Z

Apologies for the glacial pace of this PR 😬, really appreciate all the feedback! Some updates:

previously we were taking the token line and using string manipulation to get the fixme message, then rebuilding it so the regex could find it. I've since changed the logic to use just regex with capture groups instead. I've made this change for both docstrings and comments
To do this, I also had to split the docstring regex pattern between single and multiline. I tried it with just a single regex, but it got really ugly and I think the logic is much cleaner/intuitive if there's two. There's 3 regex patterns total now.
correct me if I'm wrong, but we were previously using re.search(). It seems if we're only searching the beginning of a string, then we can use re.match() which has better performance.

Let me know your thoughts!

badsketch · 2024-07-08T03:42:11Z

Those primer tests are awesome! They thankfully caught something I missed when refactoring the comment regex pattern. I was no longer supporting cases like:

#     # TODO: something something

which is kind of curious as I wonder if that should've been supported in the first place? Regardless, I've updated it so it's supported and added that as an additional functional test.

badsketch · 2024-07-08T05:00:09Z

In my last few messages, I mentioned updating how the message was extracted in comment-based fixme's so we wouldn't have to use string manipulation(198fd93, 1ce441c). I've decided to revert that change because it was causing a lot of primer test failures. I think this is due to how comment-based fixme's might be a little inconsistent in its current state:

#   TODO: msg1                                        
#   # TODO: msg2
# something # TODO: msg3
# something TODO: msg4

results in:

test.py:1:1: W0511: TODO: msg1 (fixme)
test.py:2:1: W0511: # TODO: msg2 (fixme)
test.py:3:1: W0511: something # TODO: msg3 (fixme)

I wasn't able to replicate what we're doing today using regex capture groups so in favor of not causing any disruptions, I decided to go back to the current method.

If we're open to standardizing some of the decided behavior maybe in the future and allowing changes to the primer tests, I'm also down to discuss!

DanielNoord · 2024-07-08T07:32:35Z

@badsketch With the danger of introducing more back and forth: would you be willing to create a proposal for this standardization and apply it to this PR? We can then see the test output and determine whether we are okay with that. That makes it a lot easier to discuss such standardization (with having some examples).

badsketch · 2024-07-08T12:52:34Z

@DanielNoord Sure thing, is there a formal way of making these proposals for Pylint features and do I submit it somewhere? Or do you mean try to consider all the usage out there and come back with a more detailed comment?

DanielNoord · 2024-07-08T13:51:26Z

No, I meant: just write the code as you would want and make the tests pass. If the test output is acceptable I would be okay with accepting your proposed code.

I don't think we need a full proposal, just the code as you would propose to write it and then being able to see what changes that would give to the test output.

github-actions · 2024-07-14T04:39:39Z

🤖 Effect of this PR on checked open source code: 🤖

Effect on astroid:
The following messages are no longer emitted:

fixme:
# TODO: This should return an Uninferable as this would raise
https://github.com/pylint-dev/astroid/blob/6db3a60553ff538a936d5dda23d67a3924a57f45/astroid/brain/brain_dataclasses.py#L186

Effect on black:
The following messages are no longer emitted:

fixme:
#assert ilabel not in first # XXX failed on <> ... !=
https://github.com/psf/black/blob/721dff549362f54930ecc038218dcc40e599a875/src/blib2to3/pgen2/pgen.py#L80

Effect on music21:
The following messages are no longer emitted:

fixme:
# TODO: this file does not import correctly due to first/second
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/test/test_repeat.py#L475
fixme:
# TODO: use the linter, reference DOESN'T have to be passed in
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/alpha/analysis/hasher.py#L485
fixme:
# TODO: attach \noBeam to note if it is the last note
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/lily/translate.py#L1174
fixme:
# TODO: Something with 4.2 Repetitions; not in hum2xml
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2602
fixme:
# TODO: Find out what timeBase means; not in hum2xml
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2606
fixme:
# TODO: make staff numbers relevant; not in hum2xml
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/humdrum/spineParser.py#L2610
fixme:
# TODO: Turn back on when a smaller work is found...
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/musedata/translate.py#L584
fixme:
# TODO: column 17 self.src[16] defines the graphic note type
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/musedata/__init__.py#L339
fixme:
# TODO: write
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/stream/base.py#L13034
fixme:
# TODO: write
https://github.com/cuthbertLab/music21/blob/204e9d0b9eec2f2d6ff8d8d3b13c41f912050604/music21/stream/base.py#L13038

Effect on pytest:
The following messages are no longer emitted:

fixme:
path.strpath # XXX svn?
https://github.com/pytest-dev/pytest/blob/16cdacc5a373984bc22c300d76e633c3a44bcd35/src/_pytest/_py/path.py#L193
fixme:
# XXX
https://github.com/pytest-dev/pytest/blob/16cdacc5a373984bc22c300d76e633c3a44bcd35/src/_pytest/_code/code.py#L913

Effect on pandas:
The following messages are no longer emitted:

fixme:
e.g. Sparse[bool, False] # TODO: no test cases get here
https://github.com/pandas-dev/pandas/blob/61e209e4e9b628e997a648e12e24ac47fa3e1e26/pandas/core/algorithms.py#L158

Effect on sentry:
The following messages are no longer emitted:

fixme:
type: ignore[assignment] # XXX: clobbers Serializer.fields
https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/discover/endpoints/serializers.py#L33
fixme:
type: ignore[assignment] # XXX: clobbers Serializer.fields
https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/discover/endpoints/serializers.py#L156
fixme:
type: ignore[assignment] # XXX: clobbering Serializer.fields
https://github.com/getsentry/sentry/blob/2a3aacf5227abdc6026b2a0d4c19649b7fd665ca/src/sentry/api/serializers/rest_framework/dashboard.py#L143

This comment was generated for commit eea95d2

badsketch · 2024-07-14T04:50:28Z

@DanielNoord
Gotcha, I've updated the functional tests for both comment and docstring fixme's to serve as my final proposal. The latest primer test results also align with what I've proposed. They boil down to:

comment fixme's must start with a # then any number of spaces (and only spaces), the fixme keyword, followed by a message

# TODO valid
#              TODO valid
# invalid TODO msg

single line docstring fixme's must start with 3 single/double quotes, then any number of spaces (and only spaces), the fixme keyword, followed by a message

'''TODO valid'''
""" TODO valid """
''' invalid TODO msg """

multi-line docstring fixme's are within a docstring block, but must be at the beginning of the line. Any number of indentations before it are okay.

'''
TODO valid
invalid TODO msg
'''
def foo():
      """
      TODO valid
      """

The majority of changed primer tests are for comment fixme's that no longer emit. They appear to be fixme's that are part of a larger section of code that was commented out, suggesting the fixme is no longer valid. I think it makes sense for those to no longer be emitted.

DanielNoord

LGTM!

@Pierre-Sassoulas Do you want to give this a review as well?

Pierre-Sassoulas

Thank you for working on this @badsketch !

I'm wondering why the regex contains rf"((\"\"\")|(\'\'\')), but we also do (token_info.line.lstrip().startswith(('"""', "'''"))) somewhere else ? My intuition would be that we could just search for the user defined "TODO" regex in comments and docstrings ? Is this because we want the option to not raise for docstring (for semver/compat with the old behavior) so we need to distinguish the two internally ? I didn't make any benchmarks for this but I think this kind of regex executed on each token can have huge performance impact.

Pierre-Sassoulas · 2024-07-15T21:20:39Z

doc/whatsnew/fragments/9255.feature

@@ -0,0 +1,3 @@
+Add ability for `fixme` check to also search through docstrings.


Suggested change

Add ability for `fixme` check to also search through docstrings.

The `fixme` check can now search through docstrings as well as comments, by using ``check-fixme-in-docstring = true`` in the ``[tool.pylint.miscellaneous]`` section.

badsketch · 2024-07-17T05:29:56Z

@Pierre-Sassoulas
That's a fair point. I think I can optimize it a little by making the((\"\"\")|(\'\'\')) non-capture groups. Would that suffice?

My intuition would be that we could just search for the user defined "TODO" regex in comments and docstrings ? Is this because we want the option to not raise for docstring (for semver/compat with the old behavior) so we need to distinguish the two internally ?

Hm, I'm little unclear what you mean. We're using 3 regexes (comment, single-line docstring, multi-line docstring) for clarity since it's hard to capture all of it in one regex. Also, so that, yes, we don't need to execute a more expensive docstring regex if the option is switched off. Not sure if I answered your question :/

To your point above, we could check if a TODO is present so we don't bother checking for a regex if that keyword isn't there:

    def _is_multiline_docstring(self, token_info: tokenize.TokenInfo) -> bool:
        return (
            token_info.type == tokenize.STRING
            and (token_info.line.lstrip().startswith(('"""', "'''")))
+           and 'TODO' in token_info.line
            and "\n" in token_info.line.rstrip()
        )

What might complicate this is note-rgx is enabled, then we'd be executing a regex anyways

    def _is_multiline_docstring(self, token_info: tokenize.TokenInfo) -> bool:
        return (
            token_info.type == tokenize.STRING
            and (token_info.line.lstrip().startswith(('"""', "'''")))
+           and 'TODO' in token_info.line.search(TODO|XXX|FIXME|{notes.rgx|)
            and "\n" in token_info.line.rstrip()
        )

I didn't make any benchmarks for this but I think this kind of regex executed on each token can have huge performance impact.

I'd be willing to run a profile to make a more convincing argument. Would following this guide https://pylint.readthedocs.io/en/stable/development_guide/contributor_guide/profiling.html be what you had in mind?

Support fixme's in docstrings

ccd41e0

badsketch commented Jun 22, 2024

View reviewed changes

Add changelog

90df10b

badsketch force-pushed the feat/9255-fixme branch from d9b5b78 to 90df10b Compare June 22, 2024 08:01