Linkextractors and ItemLoader Unified API #578

nramirezuy · 2014-02-03T16:53:51Z

There are lots of linkextractors with different flavors, but we don't need linkextractors we just need the filters (or processors) and a good way to handle them.

What is the different between using extractors and this approach?

>>> from scrapy.contrib.linkfilters import *
>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="./index">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response)
>>> list(loader.get_xpath('.//a/@href', Unique()))
[u'./index', u'./faq']
>>> sel = Selector(response)
>>> loader = ItemLoader(response=response)
>>> list(loader.get_xpath('(.//div)[2]/a/@href', Unique()))
[u'./faq']

My proposal have different points:

move all the pipeline logic from ItemLoader to a class Loader. (this is not needed, but fancier)
Rule instead of use a linkextractor should receive a callable that receives a Response as argument and returns a iterable of urls.
dead or rise of Link, we should add support for instantiate Request using a Link object instead of an url.

The text was updated successfully, but these errors were encountered:

kmike · 2014-04-15T10:25:00Z

I think that a task of extracting links is sufficiently different from a task of populating items, and using ItemLoader with a processor shouldn't be a blessed way to extract links e.g. for navigation.

Can't comment about other points (link extractors, rules, ..) as they are used mostly with CrawlSpider which I'm not very familiar with. It seems there is an unnecessary overlap between #331 and link extractors; it seems that if link extractors are important they should be best implemented on top of Selectors, and #331 should be implemented as a convenient shortcut for link extractors.

nramirezuy · 2014-04-15T13:44:38Z

I'm not thinking on using ItemLoader to implement it, I'm thinking in break ItemLoader and get the "Loader" part of it. This "Loader" can be used to build absolute URLs for Requests, populate the Item object directly. I didn't think too much about the subject yet, but there is a lot of functionality on ItemLoader that can be realizable.

Did you saw #346 ? Those are our processors. These processors are common tasks that we can do with comprehensions, but we have to think about that and most likely carry bugs. Like the "simple" task of taking the first element of an .extract() from Selector.

kmike · 2014-04-15T14:05:23Z

Could you please elaborate on "Loader" part? What would it load, how it differs from Selector, and why do we need both Loader and Selector?

Yes, I've seen #346. I think these processors are useful, but we shouldn't dump a lot of new classes on users if they want to do something simple. These classes help to structure the code, but people shouldn't have to learn them to do their tasks. Also, this is a style question - some people would prefer take_first(normalize(sel.xpath("//a/@href'))) over .some_method("//a/@href", Normalize(), TakeFirst()), some will use libraries like fn.py for syntactic sugar for function composition, etc.

I think this is not our battle. It is good to have such common processors, but IMHO they shouldn't be required to get started, and we should equally support people who use their existing skills to write e.g. a list comprehension instead of learning our function composition framework.

Investing in learning these processors is justified when you're writing a lot of spiders and this is your primary task, but for many people writing a spider is not a primary task - they want to quickly get some data for their work or research, and they have lots of other things to do and to learn.

kmike · 2014-04-15T14:18:11Z

I also haven't thought much about this, but if your goal is to split ItemLoader into two parts:

one is for applying and combining processors (a small framework for composing processing functions and a common set of such functions useful for scraping),
and the second is for populating items,

then +1. Devil is in details, as usual :)

kmike · 2014-04-15T14:33:21Z

See also @nyov's comments: #568 (comment) and #568 (comment). @nramirezuy - are you proposing something similar?

nramirezuy · 2014-04-15T14:44:56Z

@kmike We aren't going to remove Selector class, if people wants to write list comprehensions they can still do it. They can also forget about Loader and use the processor directly.
The main reason of be classes instead of function is because some of them need instantiation data or hold a "state".

The Loader is the pipeline and the context mainly, I can think in nothing else right now. So Loader will be a pipeline that can work with a context. Right now ItemLoader make use of 2 static (in and out) pipelines per field and 1 optional before "in".

This would give us a separation of concerns here:
Selector handles Selector(Lists)
Extractor handles extraction with processors
LinkExtractors are special Extractor cases (processors)
ItemLoaders get much more simple without the selector baggage and possibly reduced to a single 'output processor' chain?
Haven't quite thought this through, but it looks neat in concept ;)

Selector will not change.
Extractor/Loader handles the processor chain and the context of the chain.
LinkExtractor, may not exists because it is a Extractor using processors, nothing special of doing a personalized class.
ItemLoaders will not change API or functionality, just cleaned to use Loader.

nyov · 2014-04-15T21:53:46Z

+1 to uplifting the Link object.
@kmike, wouldn't that be a nice place to keep some state about a URL and implement 'have-absolute-urls' in spiders? It could have the logic to figure out it's absolute URL based on context at instancing time, and maybe a place to attach other meta info about a link.

What my idea was in this quoted comment, IIRC, is that the 'output processor' could be relegated back to the Item with this logic.
We once had a DefaultsItem and the magic is still there. If this 'final loading/populating item' stage gets simplified/generalized enough, it could be put into the Item class and then customized by subclassing Item where necessary (drop None values or not?, bring back DefaultsItem?). I think that's an intuitive way to customize item population?

THEN, all that people would need to understand is this Extractor and processor chains.
And they would be a drop-in addition to the spider, instead of current need to rewrite logic for ItemLoader.
I think this could also scale more in both dimensions, why limit it to an 'In' and '~~Out~~ In2' stage. Need another stage, add an Extractor to the chain.
And in the Extractor, you can chain Processors as right now.

From an architecture overview, this would look like Extractor == another Middleware and Processor == ItemPipeline I think; sorry for the uglyfied image.

Gallaecio · 2020-10-30T18:49:27Z

I’m unsure whether this makes more sense in Scrapy or in https://github.com/scrapy/itemloaders now. If anyone has a better grasp and you think it makes sense to move the issue to https://github.com/scrapy/itemloaders, let me know.

redapple mentioned this issue May 27, 2014

SGMLParseError when looking for links with SGMLLinkExtractor #731

Closed

rdowinton mentioned this issue Dec 29, 2014

Scrapy 1.0 release #996

Closed

cathalgarvey added the stale label Feb 20, 2018

Gallaecio added the item loaders label Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkextractors and ItemLoader Unified API #578

Linkextractors and ItemLoader Unified API #578

nramirezuy commented Feb 3, 2014

kmike commented Apr 15, 2014

nramirezuy commented Apr 15, 2014

kmike commented Apr 15, 2014

kmike commented Apr 15, 2014

kmike commented Apr 15, 2014

nramirezuy commented Apr 15, 2014

nyov commented Apr 15, 2014

Gallaecio commented Oct 30, 2020

Linkextractors and ItemLoader Unified API #578

Linkextractors and ItemLoader Unified API #578

Comments

nramirezuy commented Feb 3, 2014

kmike commented Apr 15, 2014

nramirezuy commented Apr 15, 2014

kmike commented Apr 15, 2014

kmike commented Apr 15, 2014

kmike commented Apr 15, 2014

nramirezuy commented Apr 15, 2014

nyov commented Apr 15, 2014

Gallaecio commented Oct 30, 2020