Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrawlSpider improvements #781

Closed
dangra opened this issue Jul 2, 2014 · 6 comments
Closed

CrawlSpider improvements #781

dangra opened this issue Jul 2, 2014 · 6 comments

Comments

@dangra
Copy link
Member

dangra commented Jul 2, 2014

A ticket to kickoff the discussion on CrawlSpider enhancements (if any)

ideas:

  • Simplify rule definitions by using an implicit linkextractor instanciated from LxmlLinkExtractor.
  • ... what else?
@pablohoffman
Copy link
Member

@nyov
Copy link
Contributor

nyov commented Aug 27, 2014

in regards to "overriding parse", my proposal is at #732 (but a bit more generic than CrawlSpider)
In essence, I believe in decoupling the Scraper's entrypoint from the user-facing parse.

+------------------+-----------------+---------------------+
| Scraper (caller) | Spider (callee) | UserSpider(Spider)  |
+------------------+-----------------+---------------------+
| call_spider() ---> init()/_parse() |                     |
|                  | |`-> important()|                     |
|                  | `--> parse()  <-- parse() /wo super() |
+------------------+-----------------+---------------------+

Another thing I've been thinking on in the past was designing spiders around Mixin-classes instead. Maybe that doesn't belong here. But the outcome would be that Spider "ideas" are self-contained and combineable, something like MySpider(Spider, Crawl, Init, Csv).

With InitSpider this is probably already possible to some extent, but currently it's not guranteed that classes could work together.

@redapple
Copy link
Contributor

I would also add something related to #929, that is passing the response to process_request in _requests_to_follow, so that one can tweak the generated requests with some context. (If there's another way currently than overriding _request_to_follow, it'd be happy for it to be documented :)

Oh, and also adding errback to rules could be handy at times: see http://stackoverflow.com/a/35870000/2572383

@guillaumedsde
Copy link

I would also add something related to #929, that is passing the response to process_request in _requests_to_follow, so that one can tweak the generated requests with some context. (If there's another way currently than overriding _request_to_follow, it'd be happy for it to be documented :)

Oh, and also adding errback to rules could be handy at times: see http://stackoverflow.com/a/35870000/2572383

is this a planned feature?

@elacuesta
Copy link
Member

is this a planned feature?

@guillaumedsde Please see #3682 (included in version 1.7.1) and #4000 (in progress).

@elacuesta
Copy link
Member

Seems to me like everything has been addressed 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants