Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: eliminate call_for_nodes #633

Closed
wants to merge 9 commits into from
Closed

PR: eliminate call_for_nodes #633

wants to merge 9 commits into from

Conversation

edreamleo
Copy link
Contributor

@edreamleo edreamleo commented Jan 2, 2023

A preliminary PR for PR #632.

Background

rope.base.ast.call_for_nodes looks fishy to me:

  • It defines neither a depth first nor a breadth first traversal, but a strange partial mix.
    I know of no simple description of what it does.
  • It is used in two search-related contexts, but in each case only a single node is returned.
  • Almost any change to call_for_nodes causes multiple unit tests to fail.

@lieryan Is there a simple explanation for what call_for_nodes does?

Summary of changes

  • Replace calls to call_for_nodes where the recursive kwarg is (implicitly) false.
    See patched_ast and _PatchingASTWalker._handle.
  • Remove the recursive kwarg from call_for_nodes.
  • Remove the call to call_for_nodes in _ASTMatcher.find_matches.
  • Remove the call to call_for_nodes in _ResultChecker._find_node.
  • Remove call_for_nodes itself.

Why call_for_nodes should not be a utility

I understand neither what _ASTMatcher.find_matches and _ResultChecker._find_node are supposed to do nor how they do it. Imo, these strange traversals should be presented adjacent to the search code. Duplicating these traversals is better than creating a strange helper.

@edreamleo edreamleo marked this pull request as draft January 2, 2023 11:10
@edreamleo edreamleo marked this pull request as ready for review January 2, 2023 16:40
@lieryan
Copy link
Member

lieryan commented Jan 3, 2023

I don't know what makes you think that call_for_nodes is "strange", but it is just a simple, textbook pre-order depth-first search/traversal for ast nodes.

It's implemented as recursive function i.e. using implicit stack, and it has a mechanism for early exit once it found a match.

@lieryan
Copy link
Member

lieryan commented Jan 3, 2023

_ASTMatcher.find_matches is basically the core workhorse for similarfinder, which also powers restructuring and a few other refactorings.

It's basically an AST-based code search mechanism, where Rope constructs a tree of ast nodes to match and _ASTMatcher tries to find code that matches that AST pattern.

_ResultChecker._find_node is just a helper for testing patchedast. It's for testing the sorted_children that was constructed by patchedast.py. It basically just try to find a node with a certain type, and then check that its sorted_children have the expected structure.

The output of patchedast.py itself (i.e. sorted_children) is used by other refactoring code, mainly extract and restructuring refactoring.

sorted_children have very similar structure to parso's node's children. There are going to be a bunch other potential issues for replacing stdlib ast with parso, but if we do that, than we may essentially get the sorted_children for free from parso.

@edreamleo
Copy link
Contributor Author

edreamleo commented Jan 3, 2023

@lieryan > call_for_nodes is just a simple, textbook pre-order DFS/traversal

This would be the textbook traversal:

def traverse(node, callback):
    callback(node)
    for child in ast.iter_child_nodes(node):
        traverse(child, callback)

call_for_nodes is an incomplete traversal:

def call_for_nodes(node, callback):
    result = callback(node)
    if not result:
        for child in ast.iter_child_nodes(node):
            call_for_nodes(child, callback)

call_for_nodes skips all children if the parent returns a result. Otherwise, call_for_nodes visits all children, even if one of the children's callback returns a result!

If more than one child returns a result, what happens depends on how the caller uses call_for_nodes. That's very strange, imo.

Rope uses call_for_nodes in two places. Let's look at each in turn.

_ASTMatcher.find_matches in similarfinder.py

def find_matches(self):
    if self.matches is None:
        self.matches = []
        ast.call_for_nodes(self.body, self._check_node)
    return self.matches

Only RawSimilarFinder._get_matched_asts instantiates _ASTMatcher.

_ASTMatcher.body is the same as RawSimilarFinder.ast.
I'll assume that _ASTMatcher.body could be any ast.AST node, but I might be mistaken.

Anyway, ASTMatcher._check_node appends nodes (via several helpers) to self.matches. But call_for_nodes isn't complete, so not all parts of self.body's subtree will contribute to self.matches!

Perhaps the code works (or almost works?) because the partial traversal happens when comparing two trees which are always (usually?) similar enough that the partial traversals match. But this is hand waving. I'm totally confused.

_ResultChecker._find_node in patchedasttest.py

def _find_node(self, text):
    goal = text
    if not isinstance(text, (tuple, list)):
        goal = [text]

    << define class Search >>

    search = Search()
    ast.call_for_nodes(self.ast, search)
    return search.result

class Search:
    result = None

    def __call__(self, node):
        for text in goal:
            if sys.version_info >= (3, 8) and text in [
                "Num",
                "Str",
                "NameConstant",
                "Ellipsis",
            ]:
                text = "Constant"
            if str(node).startswith(text):
                self.result = node
                break
            if node.__class__.__name__.startswith(text):
                self.result = node
                break
        return self.result is not None

Once again, the incomplete nature of call_for_node seems to make it difficult to characterize exactly which nodes get searched.

But Aha. Search.__call__ always returns True once an assignment is made. This means that the traversal stops at the first search! So at last I understand the code.

Using this Aha, perhaps_ResultChecker._find_node could be simplified. Something like this (not tested):

def _find_node(self, text):
    goal = text
    if not isinstance(text, (tuple, list)):
        goal = [text]

    def match(node):
        for text in goal:
            if sys.version_info >= (3, 8) and text in [
                "Num", "Str", "NameConstant", "Ellipsis",
            ]:
                text = "Constant"
            if str(node).startswith(text):
                raise Found(node)
            if node.__class__.__name__.startswith(text):
                raise Found(node)

    def find(node):
        match(node)
        for child in ast.iter_child_nodes(node):
            find(child)

    try:
        find(self.ast)
        return None
    except Found as node:
        return node

Summary

I don't understand why _ASTMatcher.find_matches works. If it (and its unit tests) work, they would seem to work almost by accident. If there are invariants involved, they are well hidden.

I think I do understand why _ResultChecker._find_node works. It might be possible to simplify it. Imo, call_for_nodes does Rope's code no favors.

@edreamleo edreamleo marked this pull request as draft January 3, 2023 14:09
@edreamleo
Copy link
Contributor Author

@lieryan I am going to mark this as a draft while I play with simplifying _ResultChecker._find_node.

@lieryan
Copy link
Member

lieryan commented Jan 3, 2023

Think of it like searching in a regular string:

s = "abc def ghi def"
s.index("def")

Once you found the location of "def" inside s, if the callback returned True, then it indicates that there's no need to search further inside the substring "def" to find more matches. This is the early exit part of call_for_nodes.

call_for_nodes would, however, continue searching for another match in a different part of the s, it doesn't just return the first match. In this case, it'd would continue and found both instances of "def".

Strictly speaking, _ResultChecker actually only return the last match, because earlier matches would gets overwritten, this is generally sufficient because most test cases only expects to have a single match anyway.

For the purpose extract refactoring and restructuring, they don't really have a well-defined meaning if the matched call nodes internally have additional matches. For similar finder, the current implementation only returns the outer-most match, you could potentially generate additional matches if the matched nodes have grandchildren that also matches the search pattern, but it may not necessarily be desirable to enable that by default.

@edreamleo
Copy link
Contributor Author

@lieryan Thanks for your comments. Yes, I see you are correct--the last match governs. However, because the traversal is partial, exactly what the last match will be can't be described succinctly.

@lieryan
Copy link
Member

lieryan commented Jan 3, 2023

For the purpose of _ResultChecker, tests that have multiple matches probably should've been considered a test-bug. There's really no benefit of processing multiple matches in this context, as it would just makes the test more confusing. _ResultChecker probably could've asserted that there's exactly one match (and we can rewrite the two tests that actually have more than one match).

The alternative is to make _ResultChecker's Search.result be a list of results instead of just a singular result, this would allow check_children and check_region to process multiple matches as result could accumulate more than one results. IMO, that seems to be unnecessarily complicating the test cases.

@lieryan
Copy link
Member

lieryan commented Jan 3, 2023

However, because the traversal is partial, exactly what the last match will be can't be described succinctly.

Actually, because it's a pre-order traversal, the last match is well defined. If you are searching for:

(a + b) + (c + (d + e))

with pattern:

${var1} + ${var2}

Then the match would be the outer-most last match: (c + (d + e)). Since pre-order traversal calls the callback for the visited nodes first before recursing to their children, if the callback wants to early-exit, then call_for_nodes wouldn't even recurse to (d + e) part.

Note that returning the last match isn't inherent to call_for_nodes. That's just how _ResultChecker have implemented its Search class. call_for_nodes just provides a mechanism for early exit; if the caller wants to get all matches, then it could defined a callback that never returns True. It's just turn into regular pre order traversal in that case.

@edreamleo
Copy link
Contributor Author

I am going to close this PR.

@edreamleo edreamleo closed this Jan 3, 2023
@edreamleo
Copy link
Contributor Author

@lieryan Thanks for your comment. Imo it would be good to mention the concept of "outermost-last-match" in the docstring for call_for_nodes. In fact, since this function is used only in matching contexts, it might be good to rename the function.

@lieryan
Copy link
Member

lieryan commented Jan 4, 2023

It's not correct to document the "outermost-last-match" in call_for_nodes, because it is really just a general purpose pre-order traversal algorithm, you can use it to handle multiple matches in whatever ways you choose depending on how you write the callback. Case in point, _ASTMatcher actually does handle multiple matches.

As previously mentioned, the "outermost-last-match" is only true for _ResultChecker only because _ResultChecker overwrites earlier matches. If _ResultChecker accumulates multiple matches instead in a list, then it would've collected all the matches. If it doesn't return True on match, then it would've recursively checked submatches. It's not for call_for_nodes to decide how the callback handle multiple matches.

If you want to document that it's "outermost-last-match" behavior, it would be in _ResultChecker, not in call_for_nodes. But this is a helper code for rope's own tests, so it's going to be mainly as internal documentation/developer note, not user/library documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants