Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests scheduled when idle never go through the spider middlewares #542

Closed
nside opened this issue Jan 17, 2014 · 7 comments
Closed

Requests scheduled when idle never go through the spider middlewares #542

nside opened this issue Jan 17, 2014 · 7 comments

Comments

@nside
Copy link

nside commented Jan 17, 2014

When my spider enters the idle state I schedule a request through the engine like this:

self.crawler.engine.schedule( req , self)

and I expect that request to go through the spider middlewares. But after some investigation it looks like only requests returned by Spider.start_requests and those returned from callbacks are processed by these middlewares.

Maybe I'm scheduling the request in a bad way but if so it shouldn't be public (ie I'd prefix schedule with _).

@rmax
Copy link
Contributor

rmax commented Jan 17, 2014

The spider middlewares are meant to process the spider output/input. By using the engine directly you are returning nothing from the spider and thus bypassing the spider middlewares.

I suppose you have an spider middleware, can you make it a extension or downloader middleware?

@dangra
Copy link
Member

dangra commented Jan 17, 2014

@darkrho I dissagree, requests scheduled with engine.schedule() will download and their responses should be processed by process_spider_input middleware hook. That is how engine does it for start requests and redirections.

@nside: it's more convenient to call engine.crawl() instead, it will trigger the download immediately after you schedule the request.

@nside
Copy link
Author

nside commented Jan 17, 2014

@dangra crawl() fixed it for me. Still I'd expect any requests scheduled to have the same treatment, whatever their "entry point" in the pipeline is. Feel free to close if you disagree.

@dangra
Copy link
Member

dangra commented Jan 17, 2014

@nside: what Scrapy version are you using?

historically, there were three engine entrypoints for requests: .download(), .schedule() and .crawl()

Used to be a big difference between .schedule() and .crawl() but now (>0.20) the only difference is that .crawl() triggers the loop that fetch requests from the scheduler immediately after scheduling a request.

.schedule() used to be more similar to .download(), skipping spider middlewares and the request's callback, but now it is very similar to .crawl() and only left for backward compatibility.

That said, engine api is not documented and can't be considered stable.

@nside
Copy link
Author

nside commented Jan 17, 2014

I'm on 0.21 (dev). These are good subtleties to know!

@nside nside closed this as completed Jan 17, 2014
@dangra
Copy link
Member

dangra commented Jan 17, 2014

TL;DR:

  • Use engine.crawl() if you expect your requests to be injected into the spider<->downloader pipeline.
  • Use engine.download() if you want a response back, only downloadmiddlewares are applied.
  • Don't use .schedule(), it's legacy and needs a warning.

@rmax
Copy link
Contributor

rmax commented Jan 17, 2014

Nice TL;DR. I think it should be somewhere in the docs because there are a
few examples in the wild using .schedule:
https://www.google.com/search?q=%22crawler.engine.schedule%22

On Fri, Jan 17, 2014 at 12:35 PM, Daniel Graña notifications@github.comwrote:

TL;DR:

  • Use engine.crawl() if you expect your requests to be injected into
    the spider<->downloader pipeline.
  • Use engine.download() if you want a response back, only
    downloadmiddlewares are applied.
  • Don't use .schedule(), it's legacy and needs a warning.


Reply to this email directly or view it on GitHubhttps://github.com//issues/542#issuecomment-32621164
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants