Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smarter generator check for combined return/yield statements #4721

Merged
merged 6 commits into from Aug 20, 2020

Conversation

soid
Copy link
Contributor

@soid soid commented Aug 11, 2020

Resolves #4720

soid added 2 commits Aug 11, 2020
…turn functions from the inner closures as they are not part of the generators
@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #4721 into master will increase coverage by 0.49%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4721      +/-   ##
==========================================
+ Coverage   86.70%   87.20%   +0.49%     
==========================================
  Files         160      160              
  Lines        9703     9815     +112     
  Branches     1424     1447      +23     
==========================================
+ Hits         8413     8559     +146     
+ Misses       1026      995      -31     
+ Partials      264      261       -3     
Impacted Files Coverage Δ
scrapy/utils/misc.py 95.96% <100.00%> (+0.43%) ⬆️
scrapy/core/downloader/webclient.py 94.48% <0.00%> (-3.48%) ⬇️
scrapy/core/scraper.py 87.11% <0.00%> (-0.86%) ⬇️
scrapy/settings/default_settings.py 98.74% <0.00%> (+<0.01%) ⬆️
scrapy/core/engine.py 87.50% <0.00%> (+0.05%) ⬆️
scrapy/utils/python.py 85.79% <0.00%> (+0.08%) ⬆️
scrapy/utils/log.py 89.24% <0.00%> (+0.35%) ⬆️
scrapy/utils/conf.py 93.69% <0.00%> (+0.55%) ⬆️
scrapy/downloadermiddlewares/httpcache.py 93.97% <0.00%> (+0.82%) ⬆️
scrapy/utils/reactor.py 81.35% <0.00%> (+1.35%) ⬆️
... and 6 more

@Gallaecio Gallaecio requested a review from wRAR Aug 12, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

Looks great to me!

scrapy/utils/misc.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member

elacuesta commented Aug 13, 2020

Thanks for this update. I considered the visitor approach in the original PR (#3869), but decided to go with ast.walk because of performance concerns. Is there a chance you could add a stress test to see if visiting all nodes impacts performance somehow?

@soid
Copy link
Contributor Author

soid commented Aug 13, 2020

@elacuesta good point, I briefly considered the importance of stopping, but mistakenly decided it would be too rare case with never executed code after return statement. In fact, return statement can be inside a condition, so it make sense to stop right away. Another source of slower performance could be the recursion used by the visitor. Previous false positives are expected to become slower – if before the function stopped after a return in a nested function, now it may run longer since it will traverse the generator function and not the nested functions.

So I rewrote the solution - essentially, I copied the ast.walk() function and made it skip nested function definition subtrees. It should be even faster since now it won't traverse nested functions.

I ran some micro-benchmarking and the performance looks comparable:

       new: walk_callable  new: visitor    current
count           20.000000     20.000000  20.000000
mean             5.188111      5.562837   5.215105
std              0.041617      0.064930   0.063022
min              5.130406      5.510167   5.130689
25%              5.155440      5.529315   5.181950
50%              5.177555      5.548093   5.200611
75%              5.223487      5.565969   5.234239
max              5.274709      5.805158   5.383350

(it is for 10K calls per sample, with caching removed, the code is here: soid/scrapy@f4f68db)
Considering that the function is cached I think it's insignificant differences.
Please let me know if you want me to run scrapy-bench as well.

Copy link
Member

@elacuesta elacuesta left a comment

Thanks again for the work. I ran a small benchmark, which tests a function with many (useless, I must admit) return lines and a generator that includes said function within its definition. The first one is probably an unrealistic case, but it works well to determine the efficiency of the different approaches IMHO. I believe your new walk_callable approach works better than the current and the visitor ones.

I left two (mostly aesthetic) suggestions, the PR already has one approval so if you could address those I think we should be in a position to merge.


Results:

function
           current  new: walk_callable  new: visitor
count  1000.000000         1000.000000   1000.000000
mean      0.616343            0.550850      4.033952
std       0.071496            0.030051      0.102731
min       0.534779            0.529762      3.906079
25%       0.575640            0.537758      3.970737
50%       0.591948            0.542232      4.006572
75%       0.633615            0.551045      4.063345
max       1.446230            0.867754      5.040451

generator
           current  new: walk_callable  new: visitor
count  1000.000000         1000.000000   1000.000000
mean      0.581419            0.003076      0.003181
std       0.017486            0.000506      0.000542
min       0.560853            0.002855      0.002884
25%       0.570481            0.002873      0.002909
50%       0.575656            0.002901      0.002963
75%       0.586719            0.003006      0.003212
max       0.698389            0.008358      0.007826

And the code:

import ast
import timeit
from textwrap import indent

import pandas
from scrapy.utils.misc import walk_callable


# a function with many 'return' statements
nested = "def nested():\n" + "    return 1\n" * 20_000

# a generator which contains the above function within its definition
gen_str = f"""
def gen():
{indent(nested, "    ")}
    yield 1
"""


def returns_none(return_node):
    value = return_node.value
    return value is None or isinstance(value, ast.NameConstant) and value.value is None


class GeneratorNodeVisitor(ast.NodeVisitor):
    """Visits all AST nodes of a generator, but ignores functions defined inside the generator.
    If the generator code includes a return statement then sets includes_return_statement to True.
    """
    def __init__(self):
        self.read_func_header = False
        self.includes_return_statement = False

    def visit_FunctionDef(self, node):
        if not self.read_func_header:
            # Visit the first FunctionDef and ignore all nested definitions
            # because they are not part of the examined generator
            self.read_func_header = True
            ast.NodeVisitor.generic_visit(self, node)

    def visit_Return(self, node):
        if not returns_none(node):
            self.includes_return_statement = True


def current(tree):
    for node in ast.walk(tree):
        if isinstance(node, ast.Return) and not returns_none(node):
            return True


def new_walk_callable(tree):
    for node in walk_callable(tree):
        if isinstance(node, ast.Return) and not returns_none(node):
            return True


def new_visitor(tree):
    ast_visitor = GeneratorNodeVisitor()
    ast_visitor.visit(tree)
    return ast_visitor.includes_return_statement


if __name__ == "__main__":
    reps = 1000
    number = 200

    print("function")
    tree = ast.parse(nested)
    current_results = timeit.Timer(lambda: current(tree)).repeat(reps, number=number)
    new_walk_callable_results = timeit.Timer(lambda: new_walk_callable(tree)).repeat(reps, number=number)
    new_visitor_results = timeit.Timer(lambda: new_visitor(tree)).repeat(reps, number=number)
    print(pandas.DataFrame({
        'current': current_results,
        'new: walk_callable': new_walk_callable_results,
        'new: visitor': new_visitor_results,
    }).describe())

    print()
    print("generator")
    tree = ast.parse(gen_str)
    current_results = timeit.Timer(lambda: current(tree)).repeat(reps, number=number)
    new_walk_callable_results = timeit.Timer(lambda: new_walk_callable(tree)).repeat(reps, number=number)
    new_visitor_results = timeit.Timer(lambda: new_visitor(tree)).repeat(reps, number=number)
    print(pandas.DataFrame({
        'current': current_results,
        'new: walk_callable': new_walk_callable_results,
        'new: visitor': new_visitor_results,
    }).describe())

scrapy/utils/misc.py Outdated Show resolved Hide resolved
scrapy/utils/misc.py Outdated Show resolved Hide resolved
soid and others added 2 commits Aug 20, 2020
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
Copy link
Member

@elacuesta elacuesta left a comment

Many thanks!

@elacuesta elacuesta merged commit d68aab9 into scrapy:master Aug 20, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Smarter generator check for combined yield/return statements: ignore nested functions
4 participants