Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "unified" SelectorItemLoader (supports .add_css() and .add_xpath()) #461

Merged
merged 1 commit into from Nov 22, 2013

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Nov 16, 2013

With the new unified Selector API, supporting CSS selectors at XPathItemLoader level is handy.

Feel free to suggest another name for the ItemLoader subclass.

Needs tests to check if SelectorItemLoader supports .add_xpath() and the other XPath methods from XPathItemLoader

@kmike
Copy link
Member

@kmike kmike commented Nov 16, 2013

An idea: bake all the things (add_xpath, add_css, etc) directly into ItemLoader. Raise an exception if these methods are called without response or selector provided in constructor. The goal: one less thing to care about for users, API-wise. What do you think? If not having add_css and alike methods is important, we could rename existing ItemLoader to something like BaseItemLoader.

It seems that there is no reason for SelectorItemLoader and XPathItemLoader to coexist if SelectorItemLoader supports .add_xpath(). So a more conservative idea is to merge these classes and make XPathItemLoader deprecated alias of SelectorItemLoader.

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 16, 2013

Thanks @kmike , I was thinking the same actually.
What do others think?

@pablo, @dangra any thoughts?

@dangra
Copy link
Member

@dangra dangra commented Nov 19, 2013

nice! I +1 to single ItemLoader class

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 20, 2013

I updated the loader doc page but it may need some rewording.

@dangra
dangra reviewed Nov 20, 2013
View changes
docs/topics/loaders.rst Outdated
l = XPathItemLoader(Product(), some_xpath_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l = ItemLoader(Product(), some_xpath_selector)

This comment has been minimized.

@dangra

dangra Nov 20, 2013
Member

s/some_xpath_selector/some_selector/

This comment has been minimized.

@redapple

redapple Nov 20, 2013
Author Contributor

well spotted. corrected.

@dangra
Copy link
Member

@dangra dangra commented Nov 20, 2013

I'm not sure about keeping 'BaseItemLoader' class around, at least do not use it in documentation examples.
Let's just focus everything around ItemLoader.

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 20, 2013

Sure. I also forgot to do as @kmike said though, raising exception only when .add_css() or .add_xpath() are called, not at instantiation (backward compatibility)

@kmike
Copy link
Member

@kmike kmike commented Nov 20, 2013

+1 for having a single ItemLoader class (without BaseItemLoader).

@redapple
redapple reviewed Nov 20, 2013
View changes
docs/topics/loaders.rst Outdated

Return a new Item Loader for populating the given Item. If no item is
given, one is instantiated automatically using the class in
:attr:`default_item_class`.

When instantiated with a `selector` or a `response` parameters

This comment has been minimized.

@redapple

redapple Nov 20, 2013
Author Contributor

ah man... typo error, should be "parameter"...

@kmike
kmike reviewed Nov 20, 2013
View changes
docs/topics/loaders.rst Outdated
the :class:`ItemLoader` class provides convenient mechanisms for extracting
data from web pages using :ref:`selectors <topics-selectors>`.

:param selector: The selector to extract data from, when using the

This comment has been minimized.

@kmike

kmike Nov 20, 2013
Member

I think it is better to also document item parameter here

@kmike
kmike reviewed Nov 20, 2013
View changes
docs/topics/loaders.rst Outdated
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_css('name', 'div.product_title')

This comment has been minimized.

@kmike

kmike Nov 20, 2013
Member

This makes an example a bit less clear: now it is not obvious if there could be 2 add_xpath or add_css calls with the same name or not.

@kmike
kmike reviewed Nov 20, 2013
View changes
docs/topics/loaders.rst Outdated
l.add_xpath('name', xpath2) # (2)
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath) # (1)
l.add_css('name', css) # (2)

This comment has been minimized.

@kmike

kmike Nov 20, 2013
Member

I think it is better to illustrate that with two add_css or two add_xpath calls, or maybe 3 calls (e.g. 2xCSS + 1xXPath).

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 21, 2013

Moved things around a bit in the loaders doc


:param response: The response used to construct the selector using the
:attr:`default_selector_class`, unless the selector argument is given,
in which case this argument is ignored.

This comment has been minimized.

@kmike

kmike Nov 21, 2013
Member

By the way, what's the reason exception is not raised in this case?

This comment has been minimized.

@redapple

redapple Nov 21, 2013
Author Contributor

Why ignore response if selector is given and not raise an exception? I don't really know.

But as it's legacy behavior, it's probably safer to keep it that way, although I doubt users pass both selector and response, but who knows...

This comment has been minimized.

@dangra

dangra Nov 21, 2013
Member

I think it is possible that someone is using it because even in this case response becomes part of the context

This comment has been minimized.

@kmike

kmike Nov 21, 2013
Member

this makes sense

@kmike
Copy link
Member

@kmike kmike commented Nov 21, 2013

The PR looks good to me. A nice addition to scrapy.

@dangra
dangra reviewed Nov 21, 2013
View changes
scrapy/contrib/loader/__init__.py Outdated
values = self._get_values(xpath, **kw)
self.replace_value(field_name, values, *processors, **kw)

def get_xpath(self, xpath, *processors, **kw):
self._check_selector_method()
values = self._get_values(xpath, **kw)
return self.get_value(values, *processors, **kw)

def _get_values(self, xpaths, **kw):

This comment has been minimized.

@dangra

dangra Nov 21, 2013
Member

I am tempted to suggest moving self._check_selector_method() check here. :)

This comment has been minimized.

@dangra

dangra Nov 21, 2013
Member

_get_values is asking for a renaming to _get_xpathvalues to be consistent with _get_cssvalues. Although, I am tempted to deprecate it as people extending XpathItemLoader may be using it.

@dangra
dangra reviewed Nov 21, 2013
View changes
scrapy/contrib/loader/__init__.py Outdated
values = self._get_cssvalues(css_selector, **kw)
return self.get_value(values, *processors, **kw)

def _get_cssvalues(self, css_selectors, **kw):

This comment has been minimized.

@dangra

dangra Nov 21, 2013
Member

and check for selector instance also here of course.

@dangra
dangra reviewed Nov 21, 2013
View changes
scrapy/tests/test_contrib_loader.py Outdated
self.assertRaises(RuntimeError, XPathItemLoader)
l = TestItemLoader()
self.assertRaises(RuntimeError, l.add_xpath, 'url', '//a/@href')
self.assertRaises(RuntimeError, l.add_css, 'name', '#name::text')

This comment has been minimized.

@dangra

dangra Nov 21, 2013
Member

RuntimeError coverage needs to be improved for all the methods that may raise it

@dangra
Copy link
Member

@dangra dangra commented Nov 21, 2013

The PR looks good to me. A nice addition to scrapy.

Indeed.

Before merging I'd like to see complete RuntimeError exception coverage.

The other suggested change to _check_selector_method() is a implementation detail I can live with.

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 21, 2013

if you're ok with the changes, I'll rebase and update the PR

@dangra
Copy link
Member

@dangra dangra commented Nov 21, 2013

Looks perfect to me.

Deprecate XPathItemLoader (now an alias to the new ItemLoader)
dangra added a commit that referenced this pull request Nov 22, 2013
Add "unified" SelectorItemLoader (supports .add_css() and .add_xpath())
@dangra dangra merged commit 36c8da2 into scrapy:master Nov 22, 2013
1 check passed
1 check passed
default The Travis CI build passed
Details
@redapple redapple deleted the redapple:selectorloader branch Nov 22, 2013
@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Nov 25, 2013

Nice to see this one merged!

@redapple next time ping me as @pablohoffman ;)

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 25, 2013

oh yeah! sh... :) Sorry for that (ah... autocompletion... some even built services around smartphone autocompletion mishaps ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.