Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_generator_with_return_value raises IndentationError with a flush left doc string #4477

Closed
mdaniel opened this issue Apr 8, 2020 · 4 comments · Fixed by #4935
Closed
Labels

Comments

@mdaniel
Copy link

mdaniel commented Apr 8, 2020

Description

Code that is accepted by the python interpreter raises when fed through textwrap.dedent

Steps to Reproduce

  1. Create is_generator_bug.py with the content below (which I simplified from the is_generator_with_return_value method body
  2. Run python is_generator_bug.py
  3. Observe the kaboom
import ast
import inspect
from textwrap import dedent
class Bob:
    def doit(self):
        """
this line is flush left
        """
        if True:
            yield 1234

if __name__ == '__main__':
    b = Bob()
    c = b.doit
    if inspect.isgeneratorfunction(c):
        tree = ast.parse(dedent(inspect.getsource(c)))

Expected behavior: [What you expect to happen]

No Error

Actual behavior: [What actually happens]

$ python3.7 is_generator_bug.py
Traceback (most recent call last):
  File "is_generator_bug.py", line 16, in <module>
    tree = ast.parse(dedent(inspect.getsource(c)))
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    def doit(self):
    ^
IndentationError: unexpected indent

Reproduces how often: [What percentage of the time does it reproduce?]

100%

Versions

Scrapy       : 2.0.1
lxml         : 4.5.0.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.7.7 (default, Mar 11 2020, 23:30:22) - [Clang 10.0.0 (clang-1000.11.45.5)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Darwin-17.7.0-x86_64-i386-64bit

Additional context

@elacuesta
Copy link
Member

I can reproduce:

import scrapy


class GeneratorSpider(scrapy.Spider):
    name = "generator"
    start_urls = ["https://example.org"]

    def parse(self, response):
        """
docstring
        """
        yield {"url": response.url}
2020-04-08 10:28:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
  File "/.../scrapy/venv-scrapy/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/.../scrapy/scrapy/core/downloader/middleware.py", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "/.../scrapy/venv-scrapy/lib/python3.6/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://example.org>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../scrapy/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/.../scrapy/scrapy/core/spidermw.py", line 60, in process_spider_input
    return scrape_func(response, request, spider)
  File "/.../scrapy/scrapy/core/scraper.py", line 148, in call_spider
    warn_on_generator_with_return_value(spider, callback)
  File "/.../scrapy/scrapy/utils/misc.py", line 202, in warn_on_generator_with_return_value
    if is_generator_with_return_value(callable):
  File "/.../scrapy/scrapy/utils/misc.py", line 187, in is_generator_with_return_value
    tree = ast.parse(dedent(inspect.getsource(callable)))
  File "/usr/lib/python3.6/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    def parse(self, response):
    ^
IndentationError: unexpected indent
Scrapy       : 2.0.1
lxml         : 4.5.0.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Linux-4.15.0-91-generic-x86_64-with-LinuxMint-19.1-tessa

@elacuesta elacuesta added the bug label Apr 8, 2020
@elacuesta
Copy link
Member

Seems to me like the following should work fine for docstrings:

diff --git scrapy/utils/misc.py scrapy/utils/misc.py
index 52cfba20..fa3d31cc 100644
--- scrapy/utils/misc.py
+++ scrapy/utils/misc.py
@@ -184,7 +184,10 @@ def is_generator_with_return_value(callable):
         return value is None or isinstance(value, ast.NameConstant) and value.value is None
 
     if inspect.isgeneratorfunction(callable):
-        tree = ast.parse(dedent(inspect.getsource(callable)))
+        code = inspect.getsource(callable)
+        if callable.__doc__:
+            code = code.replace(callable.__doc__, "")
+        tree = ast.parse(dedent(code))
         for node in ast.walk(tree):
             if isinstance(node, ast.Return) and not returns_none(node):
                 _generator_callbacks_cache[callable] = True

Unfortunately, I found that multiline strings are also a problem. This snippet also raises the exception:

import scrapy


class GeneratorSpider(scrapy.Spider):
    name = "generator"
    start_urls = ["https://example.org"]

    def parse(self, response):
        url = """
docstring
        """
        yield {"url": url}

The following works, but I don't really like it:

diff --git scrapy/utils/misc.py scrapy/utils/misc.py
index 52cfba20..881e6c07 100644
--- scrapy/utils/misc.py
+++ scrapy/utils/misc.py
@@ -184,7 +184,8 @@ def is_generator_with_return_value(callable):
         return value is None or isinstance(value, ast.NameConstant) and value.value is None
 
     if inspect.isgeneratorfunction(callable):
-        tree = ast.parse(dedent(inspect.getsource(callable)))
+        code = "class _:\n" + dedent(inspect.getsource(callable))
+        tree = ast.parse(code)
         for node in ast.walk(tree):
             if isinstance(node, ast.Return) and not returns_none(node):
                 _generator_callbacks_cache[callable] = True

Thoughts?

@mdaniel
Copy link
Author

mdaniel commented Apr 8, 2020

It appears based on some lightweight testing that just removing all the leading indentation from the def cures it:

    if inspect.isgeneratorfunction(callable):
        code = inspect.getsource(callable)
        code = re.sub(r"^\s+", "", code)
        tree = ast.parse(code)

since it turns this:

    def parse(self, response):
        url = """
docstring
        """
        yield {"url": url}

into

def parse(self, response):
        url = """
docstring
        """
        yield {"url": url}

@Gallaecio
Copy link
Member

@elacuesta I think your workaround is quite smart. Although I wonder if maybe it could fail in some corner cases, such as callbacks that are not actually spider methods but external functions. Still, if it solves more problems than it causes, it may be worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants