Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added docs response selector shortcuts #690

Merged
merged 8 commits into from Apr 24, 2014

Conversation

@dangra
Copy link
Member

@dangra dangra commented Apr 14, 2014

implementation of #554 on top of #686

it update docs and tutorial.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 14, 2014

@kmike ^^

@dangra dangra mentioned this pull request Apr 14, 2014
1 of 3 tasks complete
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc

Notice we import our Selector class from scrapy.selector and instantiate a

This comment has been minimized.

@kmike

kmike Apr 15, 2014
Member

we don't import our Selector class

[<Selector (text) xpath=//title/text()>]

As you can see, the ``.xpath()`` method returns an
Querying responses using XPath and CSS is so common that responses includes two
convenient shortcuts: `response.xpath()` and `response.css()`::

This comment has been minimized.

@kmike

kmike Apr 15, 2014
Member

I think we should use either double quotes or :meth:.

@kmike
Copy link
Member

@kmike kmike commented Apr 15, 2014

We should also mention new shortcuts here: http://doc.scrapy.org/en/latest/topics/request-response.html#textresponse-objects. Also, it will have to be coordinated with #684 because these PRs touch the same docs. Other than that the PR looks good!

I'm on fence about "undocumenting" sel in scrapy shell, but using response.selector and response.xpath/css is indeed more clear and easier to remember.

@nyov
Copy link
Contributor

@nyov nyov commented Apr 15, 2014

hmm, I'm not sure I like the undocumenting, or even removing sel from the shell (either).
sel in the shell is already a Selector(response) instance and therefor even faster to type than response.selector ...and response.xpath still (gleefully ignoring ipython tab completion here).
I'd prefer documenting both approaches, or at least a mention of the old possibility.

@kmike
Copy link
Member

@kmike kmike commented Apr 15, 2014

If we keep sel, should it be an alias to response.selector or should it be a new Selector (as it is now)?

@nyov
Copy link
Contributor

@nyov nyov commented Apr 15, 2014

argh, always with the hard questions! :)
the sysadmin in me says a new one for resiliency, but then I don't see where one might fail and the other not?

idk :P

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Apr 15, 2014

@kmike Keep sel as shortcut of response.selector, should be a bit more efficient since it's lazy.

@kmike
Copy link
Member

@kmike kmike commented Apr 15, 2014

@nramirezuy this is tricky because sel = response.selector will eagerly create the selector. I don't think efficiency is a problem because even if there are many Selectors for the same response the body should be parsed only once because of the LxmlDocument cache.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 22, 2014

I'm more on the remove-sel-shortcut side even if it means more typing, an advantage is that selector is lazily evaluated and non text-responses doesn't error on lxml DOM parsing, and after this pull request it is also consistent with how selection methods are called by spider code.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 22, 2014

about #684, I suppose the only solution is that first PR that get merged will impose a rebase of the other.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

about #684, I suppose the only solution is that first PR that get merged will impose a rebase of the other.

I rebased this PR on top of #684

dangra added 2 commits Apr 15, 2014
Use response.xpath(), response.css() or response.selector instead.
@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

We should also mention new shortcuts here: http://doc.scrapy.org/en/latest/topics/request-response.html#textresponse-objects.

@kmike: done by b4593c2 (review welcome)

# Using CSS query
print sel.css('p')
# Nesting queries
print sel.xpath('//div[@foo="bar"]').css('span#bold')

This comment has been minimized.

@kmike

kmike Apr 24, 2014
Member

This was the only place where it was mentioned xpath and css selectors can be chained.

This comment has been minimized.

@dangra

dangra Apr 24, 2014
Author Member

Ammended by 18412d7

@kmike
Copy link
Member

@kmike kmike commented Apr 24, 2014

I'm -0 about removing sel shortcut: not having it is cleaner and allows to avoid some of the issues, but response is longer to type (autocompletion doesn't help much because there are %reset and %reset_selective magics), and, more importantly, people are accustomed to sel shortcut. We'll be breaking people habits with this change, and it always causes displeasure. Breaking bad habits worths it, but in this case the change won't result in a more maintainable/readable user code or in other benefits for users. I think changing Scrapy Shell docs to emphasize response.xpath() and response.css() is enough.

@kmike
Copy link
Member

@kmike kmike commented Apr 24, 2014

The PR looks good. I don't have strong feelings regarding sel; maybe we should get more opinions before making a decision.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

An intermediate option is to left sel "undocummented" but use a proxy that lazily access the selector

diff --git a/scrapy/shell.py b/scrapy/shell.py
index 03024af..18bdb58 100644
--- a/scrapy/shell.py
+++ b/scrapy/shell.py
@@ -98,6 +98,7 @@ class Shell(object):
         self.vars['spider'] = spider
         self.vars['request'] = request
         self.vars['response'] = response
+        self.vars['sel'] = _SelectorProxy(response)
         if self.inthread:
             self.vars['fetch'] = self.fetch
         self.vars['view'] = open_in_browser
@@ -155,3 +156,14 @@ def _request_deferred(request):

     request.callback, request.errback = d.callback, d.errback
     return d
+
+
+class _SelectorProxy(object):
+
+    def __init__(self, response):
+        self._proxiedresponse = response
+
+    def __getattribute__(self, name):
+        if name == '_proxiedresponse':
+            return object.__getattribute__(self, '_proxiedresponse')
+        return getattr(self._proxiedresponse.selector, name)
@nyov
Copy link
Contributor

@nyov nyov commented Apr 24, 2014

You have my vote regardless, though usually we have some deprecation warning before an api change.
I don't see the shell as a part where memory-consumption is critical though, as it's likely only used on the developer machine, so I wouldn't see this as a strike against keeping sel even if it can't be lazy-loaded or linked to response.selector.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

the problem with sel is not the memory or cpu it takes to parse the tree, instead it is the error raised by unparseable responses that prevents the start of the shell console, preventing further debugging of the response.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

I kind of dislike sel if we have another shortcut that is not that much extra typing and it is the same idiom we promote to use in spiders.

but it's true the removal may cause displeasure in current users, although I hope most of them will get used to the new shortcuts quickly.

I don't like the proxy, but if we are going to leave sel around, I vote to do it trough the proxy and left it undocummented (and may be include deprecation warning?)

@nyov
Copy link
Contributor

@nyov nyov commented Apr 24, 2014

okay, I didn't know about it raising an error. I only remembered the selector simply missing when I had a text response in the past, but I can see the exception now when fetching some binary content, and that's a strong point for removing sel, or at least the need for lazy-loading.

I kind of dislike sel if we have another shortcut that is not that much extra typing and it is the same idiom we promote to use in spiders.

On the flipside importing selector directly is still viable in spider code and thus having a line sel.xpath() in the code, which then can't be easily copy/pasted into the shell any longer.

Though as stated I have no opinion either way, so just writing it to help others make their mind up.

@dangra
Copy link
Member Author

@dangra dangra commented Apr 24, 2014

I added the proxy with the deprecation warning 1e7ddc8, does it fit everyone expectations? :)

@kmike
Copy link
Member

@kmike kmike commented Apr 24, 2014

It looks fine!

dangra added a commit that referenced this pull request Apr 24, 2014
Added docs response selector shortcuts
@dangra dangra merged commit 9acf2be into scrapy:master Apr 24, 2014
1 check was pending
1 check was pending
continuous-integration/travis-ci The Travis CI build is in progress
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants