Skip to content

Fixes issue with calling fetch in scrapy shell. #5748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

alexpdev
Copy link
Contributor

@alexpdev alexpdev commented Dec 8, 2022

Reference Issue Fixes #5740 , #5742

You can recreate the issue with the following script.

import asyncio
import threading
from twisted.internet import asyncioreactor
from scrapy.utils.defer import deferred_from_coro
from scrapy.utils.reactor import get_asyncio_event_loop_policy

async def test_coro():
    pass

def test_deferred_from_coro():
    return deferred_from_coro(test_coro())

def trigger_warning_message():
    event_loop = get_asyncio_event_loop_policy().new_event_loop()
    asyncio.set_event_loop(event_loop)
    asyncioreactor.install()
    thread = threading.Thread(target=test_deferred_from_coro)
    thread.start()
    thread.join()

trigger_warning_message()

I was able to recreate this issue using scrapy shell and fetch on both windows and linux. However it only occurs inside of a project with the TWISTED_REACTOR setting set to AsyncioSelectorReactor.

What causes the issue is when get_asyncio_event_loop_policy().get_event_loop() is called, there is no event loop in the thread that the function is called from. Which raises an exception.

@codecov
Copy link

codecov bot commented Dec 8, 2022

Codecov Report

Merging #5748 (b67efd2) into master (fe60c12) will decrease coverage by 4.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5748      +/-   ##
==========================================
- Coverage   88.91%   84.78%   -4.14%     
==========================================
  Files         162      162              
  Lines       10963    10968       +5     
  Branches     1794     1794              
==========================================
- Hits         9748     9299     -449     
- Misses        932     1398     +466     
+ Partials      283      271      -12     
Impacted Files Coverage Δ
scrapy/utils/defer.py 97.41% <100.00%> (+0.08%) ⬆️
scrapy/pipelines/images.py 26.47% <0.00%> (-70.59%) ⬇️
scrapy/core/http2/stream.py 26.58% <0.00%> (-64.74%) ⬇️
scrapy/core/http2/agent.py 36.14% <0.00%> (-60.25%) ⬇️
scrapy/core/downloader/handlers/http2.py 45.07% <0.00%> (-54.93%) ⬇️
scrapy/core/http2/protocol.py 34.17% <0.00%> (-49.25%) ⬇️
scrapy/robotstxt.py 75.30% <0.00%> (-22.23%) ⬇️
scrapy/utils/test.py 51.35% <0.00%> (-14.87%) ⬇️
scrapy/core/downloader/contextfactory.py 75.92% <0.00%> (-11.12%) ⬇️
scrapy/pipelines/media.py 92.90% <0.00%> (-5.68%) ⬇️
... and 3 more

@alexpdev alexpdev closed this Dec 8, 2022
@alexpdev alexpdev reopened this Dec 9, 2022
@alexpdev
Copy link
Contributor Author

It passes all tests locally on both Linux and Windows, so I am not certain why it's failing here.

@wRAR
Copy link
Member

wRAR commented Dec 11, 2022

Can you please also add a test to test_command_shell.py that checks that there are no exceptions printed? I assume it's possible to write one that fails with the current Scrapy on Windows.

@alexpdev
Copy link
Contributor Author

Can you please also add a test to test_command_shell.py that checks that there are no exceptions printed? I assume it's possible to write one that fails with the current Scrapy on Windows.

Sure thing

Copy link
Member

@wRAR wRAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +275 to +277
# create and set new event loop for this thread
event_loop = policy.new_event_loop()
asyncio.set_event_loop(event_loop)
Copy link
Member

@Gallaecio Gallaecio Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this accounts for the ASYNCIO_EVENT_LOOP setting.

See the code were we set the event loop initially:

def install_reactor(reactor_path, event_loop_path=None):
"""Installs the :mod:`~twisted.internet.reactor` with the specified
import path. Also installs the asyncio event loop with the specified import
path if the asyncio reactor is enabled"""
reactor_class = load_object(reactor_path)
if reactor_class is asyncioreactor.AsyncioSelectorReactor:
with suppress(error.ReactorAlreadyInstalledError):
policy = get_asyncio_event_loop_policy()
if event_loop_path is not None:
event_loop_class = load_object(event_loop_path)
event_loop = event_loop_class()
asyncio.set_event_loop(event_loop)
else:
event_loop = policy.get_event_loop()

I don’t see a clear way to address this, so I suggest:

  • We document, in both the documentation of the setting (ASYNCIO_EVENT_LOOP) and that of the command (scrapy shell) that they are not compatible with each other.
  • We create a task to make them compatible at some point in the future.
  • We include a test here that demonstrates how ASYNCIO_EVENT_LOOP is not respected by scrapy shell. Hopefully the existing test for ASYNCIO_EVENT_LOOP helps with that. Funny enough, the test cannot run on Windows, as we use a Linux-specific alternative loop in those tests.

Copy link
Contributor Author

@alexpdev alexpdev Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I didn't even think about that. I made a note in the settings docs for ASYNCIO_EVENT_LOOP.

Where should the test go for showing that the setting isn't respected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should the test go for showing that the setting isn't respected?

I was hoping there would be a way to test this, but after looking into it, I am not sure how to do it myself 😓 Our ASYNCIO_EVENT_LOOP tests are based on logs, but for the same reason it is not easy to use ASYNCIO_EVENT_LOOP in the first place, we do not log a message about whether or not it is being used.

I think we can skip the testing part.

@Gallaecio
Copy link
Member

Gallaecio commented Dec 13, 2022

I wonder if we could go a different way here: rather than create the loop if missing when needed, create it as we create the thread that causes the issue in the first place.

self._start_crawler_thread()
shell = Shell(crawler, update_vars=self.update_vars, code=opts.code)
shell.start(url=url, redirect=not opts.no_redirect)
def _start_crawler_thread(self):
t = Thread(target=self.crawler_process.start,
kwargs={'stop_after_crawl': False, 'install_signal_handlers': False})
t.daemon = True
t.start()

For example, what if we modify CrawlerProcess.start, so that it takes care of initializing the loop if needed? From there we should have access to settings, i.e. be able to respect ASYNCIO_EVENT_LOOP.

@alexpdev
Copy link
Contributor Author

I wonder if we could go a different way here: rather than create the loop if missing when needed, create it as we create the thread that causes the issue in the first place.

self._start_crawler_thread()
shell = Shell(crawler, update_vars=self.update_vars, code=opts.code)
shell.start(url=url, redirect=not opts.no_redirect)
def _start_crawler_thread(self):
t = Thread(target=self.crawler_process.start,
kwargs={'stop_after_crawl': False, 'install_signal_handlers': False})
t.daemon = True
t.start()

For example, what if we modify CrawlerProcess.start, so that it takes care of initializing the loop if needed? From there we should have access to settings, i.e. be able to respect ASYNCIO_EVENT_LOOP.

I had that same thought when you first made the observation. But the thread call that causes the issue is the one called from the fetch command I thought, in the shell module.

response, spider = threads.blockingCallFromThread(

@alexpdev
Copy link
Contributor Author

alexpdev commented Dec 14, 2022

I opened #5760 which implements the alternative approach of explicitly setting the event loop upon starting of the new thread created by scrapy shell fetch command. That PR factors in the ASYNCIO_EVENT_LOOP setting, so I believe it might be the better approach. Please ignore #5759 I already closed it.

@alexpdev
Copy link
Contributor Author

Im going to close this PR. Thanks

@alexpdev alexpdev closed this Dec 14, 2022
@alexpdev alexpdev deleted the fix_scrapy_shell_fetch branch January 18, 2023 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

asyncio exception in scrapy shell
3 participants