Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make "xpath" and "css" Selector methods available on response #554

Closed
kmike opened this issue Jan 22, 2014 · 10 comments
Closed

Make "xpath" and "css" Selector methods available on response #554

kmike opened this issue Jan 22, 2014 · 10 comments

Comments

@kmike
Copy link
Member

@kmike kmike commented Jan 22, 2014

I understand that this is a compromise: HTTP response shouldn't be tied to lxml. But combined with #548 and #494, it allows to reduce the ceremony even further. Instead of this:

import urlparse
from scrapy.spider impoty Spider
from scrapy.selector import Selector
from scrapy.http import Request

class MySpider(Spider):
    # ...
    def parse(self, response):
        sel = Selector(response)
        for href in sel.xpath('//a/@href').extract():
            url = urlparse.urljoin(response.url, href)
            yield Request(url)

we'll be able to write this:

import scrapy

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield scrapy.Request(url)

The latter variant is 2x shorter, and readability is still fine (I'd say it is improved).
I think that increased internal complexity is justified by better API here.

In browser js document knows about css and xpath selectors, and they are doing just fine. The distinction between "response" and "document" is not very clear - document is the response js code works on. It doesn't provide raw http data though. But our TextResponse also doesn't work only on raw http data - it checks some headers an provides the decoded body.

Also check ItemLoader class - it can be initialized with either response or selector; this tells us Selector and Response already can act similar.

Implementation may involve renaming existing Response to something like RawResponse or HttpResponse, or adding xpath and css methods only for TextResponse. Selector could be created only on demand internally. We may also ditch LxmlDocument cache and store parsed tree as a response attribute.

@dangra
Copy link
Member

@dangra dangra commented Jan 22, 2014

Hmm, let's push further. If we remove the requirement to inherit from scrapy.spider.Spider and use the response as a request factory we ends up with:

class MySpider(object):
    # ...
    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield response.newrequest(href)

No need to import scrapy at all

I'm scared about this path but worth looking into it as the api clearly low the barrier.

@dangra
Copy link
Member

@dangra dangra commented Jan 22, 2014

quick note: Internal selector should be public, it will be used to access the complete (current and future) Selector API like .re() and namespaces.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 27, 2014

could it simply be a selector attribute for Response objects that would instantiate the Selector on first access? (like for other @property in scrapy)

@dangra
Copy link
Member

@dangra dangra commented Jan 27, 2014

@redapple: I thought that was the implicit idea, and the ticket was more like adding the shortcuts attached to the response.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 27, 2014

Ah ok. Well then I don't feel a strong need for the .xpath() and .css() shortcuts on response if one already has response.sel(ector).xpath()/.css()

@dangra
Copy link
Member

@dangra dangra commented Feb 3, 2014

Ah ok. Well then I don't feel a strong need for the .xpath() and .css() shortcuts on response if one already has response.sel(ector).xpath()/.css()

indeed.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 3, 2014

What if I don't want to parse the responses because I'm just doing raw storing ?

@redapple
Copy link
Contributor

@redapple redapple commented Feb 3, 2014

@nramirezuy , would that be handled by a @property decorator on selector attribute? i.e. parsing only if response.selector is accessed

@dangra
Copy link
Member

@dangra dangra commented Feb 3, 2014

@nramirezuy , would that be handled by a @Property decorator on selector attribute? i.e. parsing only if response.selector is accessed

We definitively wants it to be lazily evaluated.

@kmike
Copy link
Member Author

@kmike kmike commented Apr 24, 2014

Fixed by #690.

@kmike kmike closed this Apr 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.