Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] create build tool to check invalid backticks #6088

Merged
merged 8 commits into from Mar 20, 2024

Conversation

geetu040
Copy link
Contributor

@geetu040 geetu040 commented Mar 8, 2024

Reference Issues/PRs

This PR is related to this work here: #6023

What does this implement/fix? Explain your changes.

It creates a python script in build_tools that can be run to list out all the files and incorrect uses of backticks in the docstrings.

What should a reviewer concentrate their feedback on?

  • checkout the regex expressions used in function: find_invalid_backtick_text
  • I am not sure how the output should look like, I have attached a snapshot below in comments.

Any other comments?

If this tool is accepted we can use this to fix all the remaining invalid backtick uses and can later be converted to a pre-commit tool.

PR checklist

For all contributions
  • Optionally, I've added myself and possibly others to the CODEOWNERS file - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

@geetu040
Copy link
Contributor Author

geetu040 commented Mar 8, 2024

python build_tools/check_backtick.py

Here is an output snapshot

Total Files with invalid backticks: 94
/home/geetu/work/os/sktime/sktime/benchmarking/forecasting.py:12 `sktime.datasets`
/home/geetu/work/os/sktime/sktime/benchmarking/forecasting.py:79 `sktime.datasets`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:226 `<https://en.wikipedia.org/wiki/Sign_test>`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:268 `<http://en.wikipedia.org/wiki/Wilcoxon_rank-sum_test>`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:403 `friedman_test` `Nemenyi test <https://en.wikipedia.org/wiki/Nemenyi_test>`
/home/geetu/work/os/sktime/sktime/transformations/_delegate.py:63 `transform` `X` `-output` `Series` `Primitives` `pd.DataFrame` `Panel` `transform-output` `pd-multiindex`
/home/geetu/work/os/sktime/sktime/transformations/base.py:230 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:260 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:285 `other` `NotImplemented`
/home/geetu/work/os/sktime/sktime/transformations/base.py:307 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:332 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:357 `InvertTransform` `self`

fkiraly
fkiraly previously approved these changes Mar 8, 2024
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Later, we could run this from GHA, or via pytest.
pytest is possibly easier, i.e., just failing a test if the PR adds invalid use of backticks in docstring.

FYI @yarnabrina, I would appreciate your eys on this, too.

Copy link
Collaborator

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted few questions/doubts/comments.

Comment on lines 13 to 20
def find_py_files(folder_path):
"""Find all Python files in a given folder path."""
py_files = []
for root, _, files in os.walk(folder_path):
for file in files:
if file.endswith(".py"):
py_files.append(os.path.join(root, file))
return py_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably equivalent to the following:

import glob

glob.glob(f"{folder_path}/**/*.py", recursive=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that was a really great idea

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! if you have time, you might like to check
https://github.com/sktime/skbase/blob/main/skbase/lookup/_lookup.py

the object retrieval utility used in all_estimators and testing, perhaps there are some things to simplify there too!

No high prio ofc.


docstrings = {}
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.Module)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used ast module myself, so apologies if it's an obvious question. What does ast.Module represent? Is it for the docstring for the file itself?

Also, does it cover enums too? I don't know what all possibilities are there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A node of type ast.Module represents the file itself and is a root node of the tree.

Enums are basically python classes that inherit enum.Enum class, so that is covered as well

the updated code deals this differently

and isinstance(node.body[0], ast.Expr)
and isinstance(node.body[0].value, ast.Str)
):
lineno = 0 if isinstance(node, ast.Module) else node.lineno
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line do? Is it checking for first line in the body? What happens if there are empty lines before function docstrings (check 2024 release of black) or other comments for module docstrings (copyright notices etc.)?

Again apologies if ast is smart enough to distinguish codes from comments or empty lines, I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ast is actually smart enough

the updated code deals this differently

def find_invalid_backtick_text(docstring):
"""Find invalid backtick text in a docstring."""
# remove all the double-backticks
doc = docstring.replace("``", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are existing double backticks being removed?

Also, if there are triple backticks (```) or more in code, what happens to those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-backticks are fine, problem occurs only with single-backticks. I am removing them so they don't interfere with the regex expressions. I am trying to keep the regex expressions as simple as possible and doing it in steps because they can become very complicated and hard to debug

the updated code deals this differently and handles ```

doc = docstring.replace("``", "")

all_backtick_text = re.findall(r"`.*?`", doc)
valid_backtick_text = re.findall(r":.*?:(`.*?`)", doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this at all. Why text within single backticks after text of any length between a pair of colon (:) is considered valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:math:`d(x, y):= (x-y)^2`

expression like these are valid cases because they have special representation in documentation generated by rst

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks extremely specific.

Your code however matches for any :, not just ones after :math:. Will it make sense to check for that (and any other specific cases you want to allow), perhaps with positive look behind?

Copy link
Contributor Author

@geetu040 geetu040 Mar 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not specific, there are multiple cases like

:attr:`means_`
:data:`sys.stderr`
:term:`Glossary <random_state>`
:ref:`User Guide <cross_validation>`
:math:`i_k \in \{1, \dots, m\}, j_k \in \{1, \dots, n\}`

and there can exist many more

build_tools/check_backticks.py Show resolved Hide resolved
following changes have been made
1. use glob to find python files
2. improve ast node selection method
3. fix the line number on docstrings
4. handle triple-backticks
5. add useful comments where needed
Comment on lines 63 to 64
if text in invalid_backtick_text:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for unique texts right? Let's skip this and make invalid_backtick_text a set and use add below.


def main():
"""Execute the main function of the script."""
import sktime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes sktime is already installed. I don't think we can assume that for pre-commit like cases.

results = {}

# list all the python files in the project
py_files = find_py_files(sktime.__path__[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. What about accepting a path as CLI argument, and defaulting to ./sktime by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually improves the time from 2~3 seconds to 1~2 seconds as we are saving time by not importing sktime


# print the lines along with the invalid backticks text
if len(results) > 0:
print(f"Total Files with invalid backticks: {len(results)}") # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this noqa for? Is flake8 or some other linter complaining about presence of print? In that case, skip a specific error code only instead of skipping all.

print(f"Total Files with invalid backticks: {len(results)}") # noqa
for filename, result in results.items():
for lineno, errors in result.items():
print(f"{filename}:{lineno} {' '.join(errors)}") # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

for lineno, errors in result.items():
print(f"{filename}:{lineno} {' '.join(errors)}") # noqa
else:
print("No invalid backticks found") # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

yarnabrina
yarnabrina previously approved these changes Mar 12, 2024
Copy link
Collaborator

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but approved from my side for now.

Comment on lines 47 to 53
# remove all the code blocks to avoid interference
# they are allowed to use backticks in any way
docstring = re.sub(r"```.*?```", "", docstring, flags=re.DOTALL)

# remove all double-backticks to avoid interference
# we are looking only for invalid single-backtick
docstring = re.sub(r"``.*?``", "", docstring)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two probably can be merged together, if 2 or more backticks are acceptable. Something like r"(`{2,}).+?\1", but it's not tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for the sake of simplicity.

The regex expressions can be merged but they become complex and unreadable when we try to avoid this situation: "regex 2 basically covers all the cases of regex 1 because it is a sub-part of it and thus the complete code-block is not substituted"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understood what you meant, but if backreference of captured group seems complex, you can keep as it is. It was only a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not using the correct flags, so I figured that out

build_tools/check_backticks.py Show resolved Hide resolved
"""Find invalid backtick text in a docstring."""
# remove all multiple backticks to avoid interference
# we are looking only for invalid single-backtick
docstring = re.sub(r"`{2,}.*?`{2,}", "", docstring, flags=re.DOTALL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree with this pattern. It matches ```abc`` as well for example, where backticks are not matched both side. Is that acceptable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of what I meant: https://regex101.com/r/CqxGSL/1

I'm not sure if it's foolproof or not, but hopefully you get the idea.

If this is complex and causing problems, I am perfectly happy with your previous version of separate expressions.

Copy link
Contributor Author

@geetu040 geetu040 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree with this pattern. It matches ```abc `` as well for example, where backticks are not matched both side. Is that acceptable?

I think that should be acceptable, because we basically want to remove all multiple-backtick to avoid interference when searching invalid single-backtick

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it not lead to unmatched backticks which you'll replace with double backticks later?

I don't have this branch locally, will it be possible for you to run on a dummy file with such unmatched examples and share the script outputs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this script sufficient for detection though? I thought that was the primary purpose for use as pre-commit tool.

That is, if the script modifies the file, then there's something wrong with the backticks.

It would not fully fix the unmatched backticks, but it would modify the file, therefore purpose satisfied?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to detect possible (but not necessarily) invalid use of single backticks, I am mostly okay with current script. @geetu040 needs to add some changes to change the exit code, but otherwise it seems fine.

But then it's not suitable for pre-commit, as it'll fail every time unless you fix the issues. You can make it a manual stage hook and not run by default, but that'd limit the usefulness of the script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue it has its use then still, like the "create changelog" script. The latter is also buggy, but (until at least AI takes over here), it reduces the time to a complete changelog substantially.

Comment on lines +51 to +63
all_backtick_text = re.findall(r"`.*?`", docstring, flags=re.DOTALL)
# expressions like :math:`d(x, y):= (x-y)^2` are valid cases
valid_backtick_text = re.findall(r":.*?:(`.*?`)", docstring, flags=re.DOTALL)

# find all the invalid backtick code snippets
invalid_backtick_text = set()
for text in all_backtick_text:
if text in valid_backtick_text:
continue
# rst hyperlinks are valid cases
if re.match(r"`.*?<http.*?>`", text, flags=re.DOTALL):
continue
invalid_backtick_text.add(text)
Copy link
Contributor Author

@geetu040 geetu040 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina I basically want to remove all these lines with this regex expression, would you approve of it?

(?<!:)(`.*?`)(?!_)

Hers the demo: https://regex101.com/r/QZa3b3/1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I need some clarification. Why does import sktime need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnot work with newline character in single-backtick, if flag is set to Dot matches newline it finds the text between 2 invalid backtick texts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly this is a link you may be interested in

numpy/numpydoc#525

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too far down the rabbit hole ... 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reading that the single-backticks are used for multiple purposes like

  1. hyperlinks
  2. rst roles
  3. parameters
  4. variable, function, class names
  5. and more ...

And their use can differ and depends on the intent. Although there are some single-backtick texts that should have been in double-backtick to be rendered properly in the API reference, almost all of them fixed through this PR #6023, yet creating a script that can point them out would be pretty charlatan.

I think we should re-consider the idea of creating a build tool that tests for all invalid backticks because a script (non-intelligent) cannot elicit the invalid single-backticks based on the intent of use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I need some clarification. Why does import sktime need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?

Yes, actually UI, because in the API reference, single-back text is turned to itallic and double-backtick becomes code-highlighted

As said here: #5984 (review)

Btw, some docstrings use single-apostrophe (grave) by error instead of double-apostrope (double-grave) to render code snippets. If you encounter these, it would also be great to replace them.

The confusion is due to markdown havin single-grave for code snippets, but rst has double-grave.

For an example, you can see the difference in rendering of same documentation here:

@fkiraly fkiraly added documentation Documentation & tutorials maintenance Continuous integration, unit testing & package distribution enhancement Adding new functionality labels Mar 16, 2024
@fkiraly fkiraly merged commit d1f79ac into sktime:main Mar 20, 2024
94 of 98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation & tutorials enhancement Adding new functionality maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants