Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for a nested loaders #1467

Merged
merged 5 commits into from Sep 17, 2015
Merged

add support for a nested loaders #1467

merged 5 commits into from Sep 17, 2015

Conversation

@dacjames
Copy link
Contributor

@dacjames dacjames commented Aug 29, 2015

Adds a nested_loader() method to ItemLoader. This is useful when you're selecting several related fields from a sub section of the document. Because the nested loader shares the item and values with it's parent, everything works as expected.

For example, here is a snippet of scraping code using the new functionality:

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")
header_loader = loader.nested_loader(xpath="//div[@id = 'event_header']")
header_loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
loader.load_item()

You can achieve something similar by externally creating the nested loader, but you have to wire it up manually and remember to call load_item() on the nested loader.

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")

# boilerplate that must be updated if MyLoader class is changed
header_loader = MyLoader(
    item=loader.item, 
    response=loader.response, 
    selector=loader.selector.xpath("//div[@id = 'event_header']"),
)
header_loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
header_loader.load_item()  # easy to forget
loader.load_item()

Tests have been added to ensure everything works correctly, including the ordering of values and that replace_* calls work correctly.

@codecov-io
Copy link

@codecov-io codecov-io commented Aug 29, 2015

Current coverage is 82.38%

Merging #1467 into master will increase coverage by +0.05% as of 46ad0cb

Powered by Codecov. Updated on successful CI builds.

selector = self.selector.css(css)

subloader = self.__class__(
item=self.item, selector=selector, parent=self

This comment has been minimized.

@kmike

kmike Aug 29, 2015
Member

what about context?

This comment has been minimized.

@dacjames

dacjames Aug 29, 2015
Author Contributor

I wasn't sure how to handle this since I haven't used contexts very much. Would it be better for the nested loader to share the parent's context or have a new context that extends from the parent?

This comment has been minimized.

@Digenis

Digenis Aug 29, 2015
Member

@nramirezuy has a relevant gist in #858

@kmike
Copy link
Member

@kmike kmike commented Aug 29, 2015

I like this feature, +1 to add it. Implementation looks fine.
Could you please add some docs?

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Aug 29, 2015

How about that?

@kmike
Copy link
Member

@kmike kmike commented Aug 31, 2015

@dacjames looks good!
I don't use contexts either, sorry.

@dangra
Copy link
Member

@dangra dangra commented Sep 4, 2015

@nramirezuy you are the all-mighty ItemLoaders reviewer and evangelist, what do you think?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 4, 2015

@dangra Well there are two things I can think of; I don't like right now:

  • Sharing the item.
  • Single method for xpath and css.

For some reason I would like having indentation; kinda makes the blocks more clear and define a start and end to that child.

I would also like to remove response and selector from the context, and provide a way to move the loader in the meta. Sometimes makes it hard to implement processors; when you have the data in two different pages.

EDIT: Also why creating a new ItemLoader instance when we are just looking to substitute the Selector instance inside it.

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 4, 2015

@nramirezuy I split the function into two calls, hopefully eliminating concern number 2. I agree on that front, this is better.

I'm very confused by the objection to sharing the item. That's kind of the entire purpose of this feature, to be able to populate the same item with a loader specialized for a particular subset of the page.

Creating a new ItemLoader instance is desirable because modifying the existing ItemLoader leads to unexpected behavior and I like being able to use both loaders at the same time.

... will fix the coverage issue shortly.

@dacjames dacjames force-pushed the dacjames:master branch from 23df0c4 to 311d5cd Sep 4, 2015
@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 4, 2015

Rebased on scrapy master to remove the coverage error.

@kmike
Copy link
Member

@kmike kmike commented Sep 4, 2015

@nramirezuy do you propose to change syntax to this?

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")
with loader.nested_xpath("//div[@id = 'event_header']"):
    loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
    loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
    loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
loader.load_item()
@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 4, 2015

@kmike That syntax would be worse for my use case. I create two nested loaders, one for each section of the page that have partially redundant information, then go through each data point of interest and try to pull from both locations (and from the original loader in one case).

With the proposed syntax, there does not appear to be any way to reference the top-level, un-nested loader.

@kmike
Copy link
Member

@kmike kmike commented Sep 4, 2015

@dacjames could you show a code example?

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 4, 2015

Sure, I need to filter out some private info, but I'll post an example
later this evening.

On Fri, Sep 4, 2015 at 3:09 PM, Mikhail Korobov notifications@github.com
wrote:

@dacjames https://github.com/dacjames could you show a code example?


Reply to this email directly or view it on GitHub
#1467 (comment).

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 5, 2015

My code looks basically like this:

loader = CustomItemLoader(item=CustomItem(), response=response)
header_loader = loader.nested_xpath("//div[@id = 'event_header']")
summary_loader = loader.nested_xpath("//div[@id = 'summary']")

loader.add_value('src_url', response.url)

header_loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
loader.add_xpath('item_name', "//title/text()")

header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")

header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
# ...

It could be converted to context manager format, something like:

loader = CustomItemLoader(item=CustomItem(), response=response)

loader.add_value('src_url', response.url)
loader.add_xpath('item_name', "//title/text()")

with loader.nested_xpath("//div[@id = 'event_header']"):
    loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
    loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
    loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")

with loader.nested_xpath("//div[@id = 'summary']"):
    loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
    loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
    loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")

The behavior of this code is less obvious to me because the similar fields are not grouped together. More troubling, if you use the with ... as ... form, you can have an easy to miss bug (I think, haven't tested this code).

with loader.nested_xpath("//div[@id = 'summary']") as summary_loader:
    loader.add_xpath('item_name', "//title/text()")  # BUG: evaluated in nested scope
    summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
    summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
    summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 7, 2015

@dacjames having multiple instances of the same loader is an issue; first of all you gotta call load_item for each loader. Then the order matters, because it will override the field on the item.

@kmike
Copy link
Member

@kmike kmike commented Sep 13, 2015

@nramirezuy in @dacjames's implementation there is no need to call load_item() for each loader (it is a point of this PR), and an order matters just like it matters for loader.add_xpath calls with a single loader. Or did I get your comment wrong?

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 14, 2015

@nramirezuy a nested loader is a distinct instance of ItemLoader that shares some state with its parent. Specificaly, the _values and items properties are shared. I landed on this implementation because it naturally preserves the expected ordering of add_* calls.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 14, 2015

@kmike @dacjames Sorry I got that wrong.

But I still think this is the way to go:

il = Itemloader(response=response)
il.add_xpath('item_name', "//title/text()")
with il.xpath("//div[@id = 'summary']") as selector:
    il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=selector)
    il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=selector)
    il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=selector)

If you want you can:

il = Itemloader(response=response)
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)
@kmike
Copy link
Member

@kmike kmike commented Sep 14, 2015

@nramirezuy what's an adavntage of writing selector=selector in all these .add_xpath calls, why do you prefer it?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 14, 2015

Because is what you wanna do in the ItemLoader, use a different selector for instance. Also gives you the possibility to give a selector from a different source.
I think ideally ItemLoader shouldn't be attached to a Selector instance. This will allow us to move the ItemLoader in request.meta.

Something like this:

il = ItemLoader(item=item or {})
il.xpath('//title/text()', selector=response.selector)
with il(response.selector or response):
    il.xpath('//title/text()')
@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 14, 2015

@nramirezuy this PR doesn't preclude setting the selector per your example. It just provides a nice shortcut to avoid having to write the all the duplicate selector=loader.selector.xpath('/some/subquery') arguments.

I personally don't like the context manager approach because using it incorrectly is too easy.

il = ItemLoader(item=item)
# ...
with il(response.selector) as il2:
    assert il is il2
    # surprising! mutating objects as their own context managers is quite unusual
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 14, 2015

@dacjames what am I missing?

>>> class A(object):
...     selector = None
...     @contextmanager
...     def __call__(self, selector):
...         self.selector = selector
...         yield self
...         self.selector = None
... 
>>> a = A()
>>> with a('a') as x:
...     assert a is x
...     a.selector
...     x.selector
... 
'a'
'a'
>>> a.selector
>>> x.selector
@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 14, 2015

I just don't like the idea of context managers that mutate themselves.

with a('a') as x:
    # I expect to be able to access a here and NOT have it mutated
    assert a.selector is None  # False!  Very surprising to me
    assert x.selector == 'a'  # True! expected.

Since I bound the context manager to x, I would expect the changes to only affect x, when in fact they mutate a for the duration of the with block.

In this instance, I would think I could use the outer loader within the with block if I bind with as, but in fact there is no way to access the unnested loader within the with block.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 15, 2015

I can't see the issue.

>>> f = open('tmp_file', 'w')
>>> with f as x:
...     assert f is x # True

The only difference with the loader is that we never close it, so we can keep using it.

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 15, 2015

In that example, unlike your suggestion, with f does not mutate f at all. f inside the with block is exactly the same state as f outside the block.

I just don't see what advantage the context manager solution provides. It's the same amount of code and enforces a certain structure of usage. With nested loaders, I can group code how I want (see comment above), pass the nested loader to a utility function, and it's all around less magical.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 15, 2015

I think changing state inside a context is completely normal.

>>> f = open('tmp_file', 'w')
>>> f.closed
False
>>> with f:
...     f.closed
... 
False
>>> f.closed
True
>>> import threading
>>> lock = threading.Lock()
>>> lock.locked()
False
>>> with lock:
...     lock.locked()
... 
True
>>> lock.locked()
False

Well thats why you have a less magical way of doing things:

il = ItemLoader()
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)

or
"Magical"

il = ItemLoader()
il.add_xpath('item_name', "//title/text()", selector=response.selector)
with il(response.xpath("//div[@id = 'summary']")):
    il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()")
    il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()")
    il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()")
@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 15, 2015

threading.Lock() is an interesting counter example, though still quite different from your usage. The lock object itself functions as the context manager, so it being mutated is expected. In your case, the loader is functioning both as a regular object and as the context manager, which is bad, IMO. To make the difference more clear this code would be analogous to threading.Lock():

loader = ItemLoader()
with loader:
    # do something

This code that you are suggesting is not:

loader = ItemLoader()
subloader = loader('//dev/foobar')
with subloader:
    # do something 

I think the nested loader provides for more readable code (mutation is always bad if it can be avoided), but you seem incorrigible so I don't see value in debating the topic further.

@dangra
Copy link
Member

@dangra dangra commented Sep 16, 2015

The functionality and code looks good to me, even if we eventually settle on a syntax sugar around contextmanagers it will be based on the changes included in this PR.

Code is clean, tested and covered; docs needs to be updated because of the split to nested_xpath/nested_css methods.

+1 to merge after fixing docs.

@dacjames
Copy link
Contributor Author

@dacjames dacjames commented Sep 16, 2015

Updated the docs for the method split.

dangra added a commit that referenced this pull request Sep 17, 2015
add support for a nested loaders
@dangra dangra merged commit 3c596dc into scrapy:master Sep 17, 2015
1 of 2 checks passed
1 of 2 checks passed
codecov/patch 95.65% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants