[MNT] create build tool to check invalid backticks #6088

geetu040 · 2024-03-08T17:55:01Z

Reference Issues/PRs

This PR is related to this work here: #6023

What does this implement/fix? Explain your changes.

It creates a python script in build_tools that can be run to list out all the files and incorrect uses of backticks in the docstrings.

What should a reviewer concentrate their feedback on?

checkout the regex expressions used in function: find_invalid_backtick_text
I am not sure how the output should look like, I have attached a snapshot below in comments.

Any other comments?

If this tool is accepted we can use this to fix all the remaining invalid backtick uses and can later be converted to a pre-commit tool.

PR checklist

For all contributions

Optionally, I've added myself and possibly others to the CODEOWNERS file - do this if you want to become the owner or maintainer of an estimator you added.
See here for further details on the algorithm maintainer role.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

geetu040 · 2024-03-08T17:55:24Z

python build_tools/check_backtick.py

Here is an output snapshot

Total Files with invalid backticks: 94
/home/geetu/work/os/sktime/sktime/benchmarking/forecasting.py:12 `sktime.datasets`
/home/geetu/work/os/sktime/sktime/benchmarking/forecasting.py:79 `sktime.datasets`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:226 `<https://en.wikipedia.org/wiki/Sign_test>`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:268 `<http://en.wikipedia.org/wiki/Wilcoxon_rank-sum_test>`
/home/geetu/work/os/sktime/sktime/benchmarking/evaluation.py:403 `friedman_test` `Nemenyi test <https://en.wikipedia.org/wiki/Nemenyi_test>`
/home/geetu/work/os/sktime/sktime/transformations/_delegate.py:63 `transform` `X` `-output` `Series` `Primitives` `pd.DataFrame` `Panel` `transform-output` `pd-multiindex`
/home/geetu/work/os/sktime/sktime/transformations/base.py:230 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:260 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:285 `other` `NotImplemented`
/home/geetu/work/os/sktime/sktime/transformations/base.py:307 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:332 `other` `NotImplemented` `self` `sktime`
/home/geetu/work/os/sktime/sktime/transformations/base.py:357 `InvertTransform` `self`

fkiraly

Nice!

Later, we could run this from GHA, or via pytest.
pytest is possibly easier, i.e., just failing a test if the PR adds invalid use of backticks in docstring.

FYI @yarnabrina, I would appreciate your eys on this, too.

yarnabrina

Posted few questions/doubts/comments.

yarnabrina · 2024-03-09T12:34:04Z

build_tools/check_backticks.py

+def find_py_files(folder_path):
+    """Find all Python files in a given folder path."""
+    py_files = []
+    for root, _, files in os.walk(folder_path):
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files


This is probably equivalent to the following:

import glob glob.glob(f"{folder_path}/**/*.py", recursive=True)

Thanks, that was a really great idea

nice! if you have time, you might like to check
https://github.com/sktime/skbase/blob/main/skbase/lookup/_lookup.py

the object retrieval utility used in all_estimators and testing, perhaps there are some things to simplify there too!

No high prio ofc.

yarnabrina · 2024-03-09T12:42:04Z

build_tools/check_backticks.py

+
+    docstrings = {}
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.Module)):


I've never used ast module myself, so apologies if it's an obvious question. What does ast.Module represent? Is it for the docstring for the file itself?

Also, does it cover enums too? I don't know what all possibilities are there.

A node of type ast.Module represents the file itself and is a root node of the tree.

Enums are basically python classes that inherit enum.Enum class, so that is covered as well

the updated code deals this differently

yarnabrina · 2024-03-09T12:44:29Z

build_tools/check_backticks.py

+                and isinstance(node.body[0], ast.Expr)
+                and isinstance(node.body[0].value, ast.Str)
+            ):
+                lineno = 0 if isinstance(node, ast.Module) else node.lineno


What does this line do? Is it checking for first line in the body? What happens if there are empty lines before function docstrings (check 2024 release of black) or other comments for module docstrings (copyright notices etc.)?

Again apologies if ast is smart enough to distinguish codes from comments or empty lines, I don't know.

ast is actually smart enough

the updated code deals this differently

yarnabrina · 2024-03-09T12:48:59Z

build_tools/check_backticks.py

+def find_invalid_backtick_text(docstring):
+    """Find invalid backtick text in a docstring."""
+    # remove all the double-backticks
+    doc = docstring.replace("``", "")


Why are existing double backticks being removed?

Also, if there are triple backticks (```) or more in code, what happens to those?

double-backticks are fine, problem occurs only with single-backticks. I am removing them so they don't interfere with the regex expressions. I am trying to keep the regex expressions as simple as possible and doing it in steps because they can become very complicated and hard to debug

the updated code deals this differently and handles ```

yarnabrina · 2024-03-09T12:58:06Z

build_tools/check_backticks.py

+    doc = docstring.replace("``", "")
+
+    all_backtick_text = re.findall(r"`.*?`", doc)
+    valid_backtick_text = re.findall(r":.*?:(`.*?`)", doc)


I don't follow this at all. Why text within single backticks after text of any length between a pair of colon (:) is considered valid?

:math:`d(x, y):= (x-y)^2`

expression like these are valid cases because they have special representation in documentation generated by rst

This looks extremely specific.

Your code however matches for any :, not just ones after :math:. Will it make sense to check for that (and any other specific cases you want to allow), perhaps with positive look behind?

Its not specific, there are multiple cases like

:attr:`means_` :data:`sys.stderr` :term:`Glossary <random_state>` :ref:`User Guide <cross_validation>` :math:`i_k \in \{1, \dots, m\}, j_k \in \{1, \dots, n\}`

and there can exist many more

build_tools/check_backticks.py

following changes have been made 1. use glob to find python files 2. improve ast node selection method 3. fix the line number on docstrings 4. handle triple-backticks 5. add useful comments where needed

yarnabrina · 2024-03-10T14:46:51Z

build_tools/check_backticks.py

+        if text in invalid_backtick_text:
+            continue


This is just for unique texts right? Let's skip this and make invalid_backtick_text a set and use add below.

yarnabrina · 2024-03-10T14:48:11Z

build_tools/check_backticks.py

+
+def main():
+    """Execute the main function of the script."""
+    import sktime


This assumes sktime is already installed. I don't think we can assume that for pre-commit like cases.

yarnabrina · 2024-03-10T14:49:11Z

build_tools/check_backticks.py

+    results = {}
+
+    # list all the python files in the project
+    py_files = find_py_files(sktime.__path__[0])


Same as above. What about accepting a path as CLI argument, and defaulting to ./sktime by default?

this actually improves the time from 2~3 seconds to 1~2 seconds as we are saving time by not importing sktime

yarnabrina · 2024-03-10T14:50:50Z

build_tools/check_backticks.py

+
+    # print the lines along with the invalid backticks text
+    if len(results) > 0:
+        print(f"Total Files with invalid backticks: {len(results)}")  # noqa


What is this noqa for? Is flake8 or some other linter complaining about presence of print? In that case, skip a specific error code only instead of skipping all.

yarnabrina · 2024-03-10T14:51:00Z

build_tools/check_backticks.py

+        print(f"Total Files with invalid backticks: {len(results)}")  # noqa
+        for filename, result in results.items():
+            for lineno, errors in result.items():
+                print(f"{filename}:{lineno} {' '.join(errors)}")  # noqa


Same as above.

yarnabrina · 2024-03-10T14:51:09Z

build_tools/check_backticks.py

+            for lineno, errors in result.items():
+                print(f"{filename}:{lineno} {' '.join(errors)}")  # noqa
+    else:
+        print("No invalid backticks found")  # noqa


Same as above.

yarnabrina

Added some comments, but approved from my side for now.

yarnabrina · 2024-03-12T16:43:26Z

build_tools/check_backticks.py

+    # remove all the code blocks to avoid interference
+    # they are allowed to use backticks in any way
+    docstring = re.sub(r"```.*?```", "", docstring, flags=re.DOTALL)
+
+    # remove all double-backticks to avoid interference
+    # we are looking only for invalid single-backtick
+    docstring = re.sub(r"``.*?``", "", docstring)


These two probably can be merged together, if 2 or more backticks are acceptable. Something like r"(`{2,}).+?\1", but it's not tested.

I did this for the sake of simplicity.

The regex expressions can be merged but they become complex and unreadable when we try to avoid this situation: "regex 2 basically covers all the cases of regex 1 because it is a sub-part of it and thus the complete code-block is not substituted"

I don't think I understood what you meant, but if backreference of captured group seems complex, you can keep as it is. It was only a suggestion.

I was not using the correct flags, so I figured that out

build_tools/check_backticks.py

yarnabrina · 2024-03-13T02:52:29Z

build_tools/check_backticks.py

+    """Find invalid backtick text in a docstring."""
+    # remove all multiple backticks to avoid interference
+    # we are looking only for invalid single-backtick
+    docstring = re.sub(r"`{2,}.*?`{2,}", "", docstring, flags=re.DOTALL)


I don't think I agree with this pattern. It matches ```abc`` as well for example, where backticks are not matched both side. Is that acceptable?

Here's an example of what I meant: https://regex101.com/r/CqxGSL/1

I'm not sure if it's foolproof or not, but hopefully you get the idea.

If this is complex and causing problems, I am perfectly happy with your previous version of separate expressions.

I don't think I agree with this pattern. It matches ```abc `` as well for example, where backticks are not matched both side. Is that acceptable?

I think that should be acceptable, because we basically want to remove all multiple-backtick to avoid interference when searching invalid single-backtick

Will it not lead to unmatched backticks which you'll replace with double backticks later?

I don't have this branch locally, will it be possible for you to run on a dummy file with such unmatched examples and share the script outputs?

Isn't this script sufficient for detection though? I thought that was the primary purpose for use as pre-commit tool.

That is, if the script modifies the file, then there's something wrong with the backticks.

It would not fully fix the unmatched backticks, but it would modify the file, therefore purpose satisfied?

If the goal is to detect possible (but not necessarily) invalid use of single backticks, I am mostly okay with current script. @geetu040 needs to add some changes to change the exit code, but otherwise it seems fine.

But then it's not suitable for pre-commit, as it'll fail every time unless you fix the issues. You can make it a manual stage hook and not run by default, but that'd limit the usefulness of the script.

I'd argue it has its use then still, like the "create changelog" script. The latter is also buggy, but (until at least AI takes over here), it reduces the time to a complete changelog substantially.

geetu040 · 2024-03-13T04:06:29Z

build_tools/check_backticks.py

+    all_backtick_text = re.findall(r"`.*?`", docstring, flags=re.DOTALL)
+    # expressions like :math:`d(x, y):= (x-y)^2` are valid cases
+    valid_backtick_text = re.findall(r":.*?:(`.*?`)", docstring, flags=re.DOTALL)
+
+    # find all the invalid backtick code snippets
+    invalid_backtick_text = set()
+    for text in all_backtick_text:
+        if text in valid_backtick_text:
+            continue
+        # rst hyperlinks are valid cases
+        if re.match(r"`.*?<http.*?>`", text, flags=re.DOTALL):
+            continue
+        invalid_backtick_text.add(text)


@yarnabrina I basically want to remove all these lines with this regex expression, would you approve of it?

(?<!:)(`.*?`)(?!_)

Hers the demo: https://regex101.com/r/QZa3b3/1

Sorry, but I need some clarification. Why does import sktime need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?

this doesnot work with newline character in single-backtick, if flag is set to Dot matches newline it finds the text between 2 invalid backtick texts

@fkiraly this is a link you may be interested in

numpy/numpydoc#525

too far down the rabbit hole ... 😄

I am reading that the single-backticks are used for multiple purposes like

hyperlinks

rst roles

parameters

variable, function, class names

and more ...

And their use can differ and depends on the intent. Although there are some single-backtick texts that should have been in double-backtick to be rendered properly in the API reference, almost all of them fixed through this PR #6023, yet creating a script that can point them out would be pretty charlatan.

I think we should re-consider the idea of creating a build tool that tests for all invalid backticks because a script (non-intelligent) cannot elicit the invalid single-backticks based on the intent of use.

Sorry, but I need some clarification. Why does import sktime need to be in double backticks? It's not a reference to an existing object, so won't link it anyway. Does it significantly alter their appearance in UI for better?

Yes, actually UI, because in the API reference, single-back text is turned to itallic and double-backtick becomes code-highlighted

As said here: #5984 (review)

Btw, some docstrings use single-apostrophe (grave) by error instead of double-apostrope (double-grave) to render code snippets. If you encounter these, it would also be great to replace them.

The confusion is due to markdown havin single-grave for code snippets, but rst has double-grave.

For an example, you can see the difference in rendering of same documentation here:

use of single-backtick

use of double-backtick

[MNT] create build tool to check invalid backticks

cf74136

geetu040 requested review from achieveordie, benHeid, fkiraly and yarnabrina as code owners March 8, 2024 17:55

fkiraly previously approved these changes Mar 8, 2024

View reviewed changes

yarnabrina requested changes Mar 9, 2024

View reviewed changes

[MNT] fix minor issues; improve existing code; add comments

039fe0c

following changes have been made 1. use glob to find python files 2. improve ast node selection method 3. fix the line number on docstrings 4. handle triple-backticks 5. add useful comments where needed

geetu040 dismissed fkiraly’s stale review via 039fe0c March 9, 2024 17:48

geetu040 requested a review from yarnabrina March 10, 2024 04:34

yarnabrina requested changes Mar 10, 2024

View reviewed changes

geetu040 added 3 commits March 10, 2024 21:28

accept folder_path as CLI argument which defaults to "./sktime"

657f7de

use set instead of list for invalid_backtick_text

583b822

flake8: skip specific error code instead of all for print statements

18740e4

geetu040 requested a review from yarnabrina March 10, 2024 16:41

yarnabrina previously approved these changes Mar 12, 2024

View reviewed changes

geetu040 added 3 commits March 13, 2024 00:26

merges multiple backtick regex

2d67ce2

adds check for rst hyperlinks

85b9c75

fixes mutated hyperlinks in sktime#6023

ac9bb24

geetu040 dismissed yarnabrina’s stale review via ac9bb24 March 12, 2024 20:50

yarnabrina reviewed Mar 13, 2024

View reviewed changes

geetu040 commented Mar 13, 2024

View reviewed changes

fkiraly added documentation Documentation & tutorials maintenance Continuous integration, unit testing & package distribution enhancement Adding new functionality labels Mar 16, 2024

fkiraly merged commit d1f79ac into sktime:main Mar 20, 2024
94 of 98 checks passed

[MNT] create build tool to check invalid backticks #6088

[MNT] create build tool to check invalid backticks #6088

Conversation

geetu040 commented Mar 8, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

geetu040 commented Mar 8, 2024

fkiraly left a comment

Choose a reason for hiding this comment

yarnabrina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geetu040 Mar 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarnabrina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geetu040 Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geetu040 Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geetu040 Mar 10, 2024 •

edited

geetu040 Mar 13, 2024 •

edited

geetu040 Mar 13, 2024 •

edited