New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MNT] create build tool to check invalid backticks #6088
Conversation
python build_tools/check_backtick.py Here is an output snapshot
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Later, we could run this from GHA, or via pytest
.
pytest
is possibly easier, i.e., just failing a test if the PR adds invalid use of backticks in docstring.
FYI @yarnabrina, I would appreciate your eys on this, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted few questions/doubts/comments.
build_tools/check_backticks.py
Outdated
def find_py_files(folder_path): | ||
"""Find all Python files in a given folder path.""" | ||
py_files = [] | ||
for root, _, files in os.walk(folder_path): | ||
for file in files: | ||
if file.endswith(".py"): | ||
py_files.append(os.path.join(root, file)) | ||
return py_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably equivalent to the following:
import glob
glob.glob(f"{folder_path}/**/*.py", recursive=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that was a really great idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! if you have time, you might like to check
https://github.com/sktime/skbase/blob/main/skbase/lookup/_lookup.py
the object retrieval utility used in all_estimators
and testing, perhaps there are some things to simplify there too!
No high prio ofc.
build_tools/check_backticks.py
Outdated
|
||
docstrings = {} | ||
for node in ast.walk(tree): | ||
if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.Module)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used ast
module myself, so apologies if it's an obvious question. What does ast.Module
represent? Is it for the docstring for the file itself?
Also, does it cover enums too? I don't know what all possibilities are there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A node of type ast.Module
represents the file itself and is a root node of the tree.
Enums
are basically python classes that inherit enum.Enum
class, so that is covered as well
the updated code deals this differently
build_tools/check_backticks.py
Outdated
and isinstance(node.body[0], ast.Expr) | ||
and isinstance(node.body[0].value, ast.Str) | ||
): | ||
lineno = 0 if isinstance(node, ast.Module) else node.lineno |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line do? Is it checking for first line in the body? What happens if there are empty lines before function docstrings (check 2024 release of black) or other comments for module docstrings (copyright notices etc.)?
Again apologies if ast
is smart enough to distinguish codes from comments or empty lines, I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ast
is actually smart enough
the updated code deals this differently
build_tools/check_backticks.py
Outdated
def find_invalid_backtick_text(docstring): | ||
"""Find invalid backtick text in a docstring.""" | ||
# remove all the double-backticks | ||
doc = docstring.replace("``", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are existing double backticks being removed?
Also, if there are triple backticks (```
) or more in code, what happens to those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-backticks are fine, problem occurs only with single-backticks. I am removing them so they don't interfere with the regex expressions. I am trying to keep the regex expressions as simple as possible and doing it in steps because they can become very complicated and hard to debug
the updated code deals this differently and handles ```
build_tools/check_backticks.py
Outdated
doc = docstring.replace("``", "") | ||
|
||
all_backtick_text = re.findall(r"`.*?`", doc) | ||
valid_backtick_text = re.findall(r":.*?:(`.*?`)", doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this at all. Why text within single backticks after text of any length between a pair of colon (:
) is considered valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:math:`d(x, y):= (x-y)^2`
expression like these are valid cases because they have special representation in documentation generated by rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks extremely specific.
Your code however matches for any :
, not just ones after :math:
. Will it make sense to check for that (and any other specific cases you want to allow), perhaps with positive look behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not specific, there are multiple cases like
:attr:`means_`
:data:`sys.stderr`
:term:`Glossary <random_state>`
:ref:`User Guide <cross_validation>`
:math:`i_k \in \{1, \dots, m\}, j_k \in \{1, \dots, n\}`
and there can exist many more
following changes have been made 1. use glob to find python files 2. improve ast node selection method 3. fix the line number on docstrings 4. handle triple-backticks 5. add useful comments where needed
build_tools/check_backticks.py
Outdated
if text in invalid_backtick_text: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for unique texts right? Let's skip this and make invalid_backtick_text
a set and use add
below.
build_tools/check_backticks.py
Outdated
|
||
def main(): | ||
"""Execute the main function of the script.""" | ||
import sktime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes sktime
is already installed. I don't think we can assume that for pre-commit
like cases.
build_tools/check_backticks.py
Outdated
results = {} | ||
|
||
# list all the python files in the project | ||
py_files = find_py_files(sktime.__path__[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. What about accepting a path as CLI argument, and defaulting to ./sktime
by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually improves the time from 2~3
seconds to 1~2
seconds as we are saving time by not importing sktime
build_tools/check_backticks.py
Outdated
|
||
# print the lines along with the invalid backticks text | ||
if len(results) > 0: | ||
print(f"Total Files with invalid backticks: {len(results)}") # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this noqa
for? Is flake8
or some other linter complaining about presence of print
? In that case, skip a specific error code only instead of skipping all.
build_tools/check_backticks.py
Outdated
print(f"Total Files with invalid backticks: {len(results)}") # noqa | ||
for filename, result in results.items(): | ||
for lineno, errors in result.items(): | ||
print(f"{filename}:{lineno} {' '.join(errors)}") # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
build_tools/check_backticks.py
Outdated
for lineno, errors in result.items(): | ||
print(f"{filename}:{lineno} {' '.join(errors)}") # noqa | ||
else: | ||
print("No invalid backticks found") # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, but approved from my side for now.
build_tools/check_backticks.py
Outdated
# remove all the code blocks to avoid interference | ||
# they are allowed to use backticks in any way | ||
docstring = re.sub(r"```.*?```", "", docstring, flags=re.DOTALL) | ||
|
||
# remove all double-backticks to avoid interference | ||
# we are looking only for invalid single-backtick | ||
docstring = re.sub(r"``.*?``", "", docstring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two probably can be merged together, if 2 or more backticks are acceptable. Something like r"(`{2,}).+?\1"
, but it's not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this for the sake of simplicity.
The regex expressions can be merged but they become complex and unreadable when we try to avoid this situation: "regex 2 basically covers all the cases of regex 1 because it is a sub-part of it and thus the complete code-block is not substituted"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understood what you meant, but if backreference of captured group seems complex, you can keep as it is. It was only a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not using the correct flags
, so I figured that out
"""Find invalid backtick text in a docstring.""" | ||
# remove all multiple backticks to avoid interference | ||
# we are looking only for invalid single-backtick | ||
docstring = re.sub(r"`{2,}.*?`{2,}", "", docstring, flags=re.DOTALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree with this pattern. It matches ```abc``
as well for example, where backticks are not matched both side. Is that acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of what I meant: https://regex101.com/r/CqxGSL/1
I'm not sure if it's foolproof or not, but hopefully you get the idea.
If this is complex and causing problems, I am perfectly happy with your previous version of separate expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree with this pattern. It matches
```abc
`` as well for example, where backticks are not matched both side. Is that acceptable?
I think that should be acceptable, because we basically want to remove all multiple-backtick to avoid interference when searching invalid single-backtick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it not lead to unmatched backticks which you'll replace with double backticks later?
I don't have this branch locally, will it be possible for you to run on a dummy file with such unmatched examples and share the script outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this script sufficient for detection though? I thought that was the primary purpose for use as pre-commit tool.
That is, if the script modifies the file, then there's something wrong with the backticks.
It would not fully fix the unmatched backticks, but it would modify the file, therefore purpose satisfied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to detect possible (but not necessarily) invalid use of single backticks, I am mostly okay with current script. @geetu040 needs to add some changes to change the exit code, but otherwise it seems fine.
But then it's not suitable for pre-commit, as it'll fail every time unless you fix the issues. You can make it a manual stage hook and not run by default, but that'd limit the usefulness of the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue it has its use then still, like the "create changelog" script. The latter is also buggy, but (until at least AI takes over here), it reduces the time to a complete changelog substantially.
all_backtick_text = re.findall(r"`.*?`", docstring, flags=re.DOTALL) | ||
# expressions like :math:`d(x, y):= (x-y)^2` are valid cases | ||
valid_backtick_text = re.findall(r":.*?:(`.*?`)", docstring, flags=re.DOTALL) | ||
|
||
# find all the invalid backtick code snippets | ||
invalid_backtick_text = set() | ||
for text in all_backtick_text: | ||
if text in valid_backtick_text: | ||
continue | ||
# rst hyperlinks are valid cases | ||
if re.match(r"`.*?<http.*?>`", text, flags=re.DOTALL): | ||
continue | ||
invalid_backtick_text.add(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarnabrina I basically want to remove all these lines with this regex expression, would you approve of it?
(?<!:)(`.*?`)(?!_)
Hers the demo: https://regex101.com/r/QZa3b3/1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I need some clarification. Why does import sktime
need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesnot work with newline character in single-backtick, if flag is set to Dot matches newline
it finds the text between 2 invalid backtick texts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fkiraly this is a link you may be interested in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too far down the rabbit hole ... 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reading that the single-backticks are used for multiple purposes like
- hyperlinks
- rst roles
- parameters
- variable, function, class names
- and more ...
And their use can differ and depends on the intent. Although there are some single-backtick texts that should have been in double-backtick to be rendered properly in the API reference, almost all of them fixed through this PR #6023, yet creating a script that can point them out would be pretty charlatan.
I think we should re-consider the idea of creating a build tool that tests for all invalid backticks because a script (non-intelligent) cannot elicit the invalid single-backticks based on the intent of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I need some clarification. Why does
import sktime
need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?
Yes, actually UI, because in the API reference, single-back text is turned to itallic and double-backtick becomes code-highlighted
As said here: #5984 (review)
Btw, some docstrings use single-apostrophe (grave) by error instead of double-apostrope (double-grave) to render code snippets. If you encounter these, it would also be great to replace them.
The confusion is due to markdown havin single-grave for code snippets, but rst has double-grave.
For an example, you can see the difference in rendering of same documentation here:
Reference Issues/PRs
This PR is related to this work here: #6023
What does this implement/fix? Explain your changes.
It creates a python script in
build_tools
that can be run to list out all the files and incorrect uses ofbackticks
in the docstrings.What should a reviewer concentrate their feedback on?
find_invalid_backtick_text
Any other comments?
If this tool is accepted we can use this to fix all the remaining invalid
backtick
uses and can later be converted to a pre-commit tool.PR checklist
For all contributions
See here for further details on the algorithm maintainer role.