Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkextractors and ItemLoader Unified API #578

Open
nramirezuy opened this issue Feb 3, 2014 · 8 comments
Open

Linkextractors and ItemLoader Unified API #578

nramirezuy opened this issue Feb 3, 2014 · 8 comments

Comments

@nramirezuy
Copy link
Contributor

There are lots of linkextractors with different flavors, but we don't need linkextractors we just need the filters (or processors) and a good way to handle them.

What is the different between using extractors and this approach?

>>> from scrapy.contrib.linkfilters import *
>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="./index">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response)
>>> list(loader.get_xpath('.//a/@href', Unique()))
[u'./index', u'./faq']
>>> sel = Selector(response)
>>> loader = ItemLoader(response=response)
>>> list(loader.get_xpath('(.//div)[2]/a/@href', Unique()))
[u'./faq']

My proposal have different points:

  • move all the pipeline logic from ItemLoader to a class Loader. (this is not needed, but fancier)
  • Rule instead of use a linkextractor should receive a callable that receives a Response as argument and returns a iterable of urls.
  • dead or rise of Link, we should add support for instantiate Request using a Link object instead of an url.
@kmike
Copy link
Member

kmike commented Apr 15, 2014

I think that a task of extracting links is sufficiently different from a task of populating items, and using ItemLoader with a processor shouldn't be a blessed way to extract links e.g. for navigation.

Can't comment about other points (link extractors, rules, ..) as they are used mostly with CrawlSpider which I'm not very familiar with. It seems there is an unnecessary overlap between #331 and link extractors; it seems that if link extractors are important they should be best implemented on top of Selectors, and #331 should be implemented as a convenient shortcut for link extractors.

@nramirezuy
Copy link
Contributor Author

I'm not thinking on using ItemLoader to implement it, I'm thinking in break ItemLoader and get the "Loader" part of it. This "Loader" can be used to build absolute URLs for Requests, populate the Item object directly. I didn't think too much about the subject yet, but there is a lot of functionality on ItemLoader that can be realizable.

Did you saw #346 ? Those are our processors. These processors are common tasks that we can do with comprehensions, but we have to think about that and most likely carry bugs. Like the "simple" task of taking the first element of an .extract() from Selector.

@kmike
Copy link
Member

kmike commented Apr 15, 2014

Could you please elaborate on "Loader" part? What would it load, how it differs from Selector, and why do we need both Loader and Selector?

Yes, I've seen #346. I think these processors are useful, but we shouldn't dump a lot of new classes on users if they want to do something simple. These classes help to structure the code, but people shouldn't have to learn them to do their tasks. Also, this is a style question - some people would prefer take_first(normalize(sel.xpath("//a/@href'))) over .some_method("//a/@href", Normalize(), TakeFirst()), some will use libraries like fn.py for syntactic sugar for function composition, etc.

I think this is not our battle. It is good to have such common processors, but IMHO they shouldn't be required to get started, and we should equally support people who use their existing skills to write e.g. a list comprehension instead of learning our function composition framework.

Investing in learning these processors is justified when you're writing a lot of spiders and this is your primary task, but for many people writing a spider is not a primary task - they want to quickly get some data for their work or research, and they have lots of other things to do and to learn.

@kmike
Copy link
Member

kmike commented Apr 15, 2014

I also haven't thought much about this, but if your goal is to split ItemLoader into two parts:

  • one is for applying and combining processors (a small framework for composing processing functions and a common set of such functions useful for scraping),
  • and the second is for populating items,

then +1. Devil is in details, as usual :)

@kmike
Copy link
Member

kmike commented Apr 15, 2014

See also @nyov's comments: #568 (comment) and #568 (comment). @nramirezuy - are you proposing something similar?

@nramirezuy
Copy link
Contributor Author

@kmike We aren't going to remove Selector class, if people wants to write list comprehensions they can still do it. They can also forget about Loader and use the processor directly.
The main reason of be classes instead of function is because some of them need instantiation data or hold a "state".

The Loader is the pipeline and the context mainly, I can think in nothing else right now. So Loader will be a pipeline that can work with a context. Right now ItemLoader make use of 2 static (in and out) pipelines per field and 1 optional before "in".

This would give us a separation of concerns here:
Selector handles Selector(Lists)
Extractor handles extraction with processors
LinkExtractors are special Extractor cases (processors)
ItemLoaders get much more simple without the selector baggage and possibly reduced to a single 'output processor' chain?
Haven't quite thought this through, but it looks neat in concept ;)

Selector will not change.
Extractor/Loader handles the processor chain and the context of the chain.
LinkExtractor, may not exists because it is a Extractor using processors, nothing special of doing a personalized class.
ItemLoaders will not change API or functionality, just cleaned to use Loader.

@nyov
Copy link
Contributor

nyov commented Apr 15, 2014

+1 to uplifting the Link object.
@kmike, wouldn't that be a nice place to keep some state about a URL and implement 'have-absolute-urls' in spiders? It could have the logic to figure out it's absolute URL based on context at instancing time, and maybe a place to attach other meta info about a link.


What my idea was in this quoted comment, IIRC, is that the 'output processor' could be relegated back to the Item with this logic.
We once had a DefaultsItem and the magic is still there. If this 'final loading/populating item' stage gets simplified/generalized enough, it could be put into the Item class and then customized by subclassing Item where necessary (drop None values or not?, bring back DefaultsItem?). I think that's an intuitive way to customize item population?

THEN, all that people would need to understand is this Extractor and processor chains.
And they would be a drop-in addition to the spider, instead of current need to rewrite logic for ItemLoader.
I think this could also scale more in both dimensions, why limit it to an 'In' and 'Out In2' stage. Need another stage, add an Extractor to the chain.
And in the Extractor, you can chain Processors as right now.

From an architecture overview, this would look like Extractor == another Middleware and Processor == ItemPipeline I think; sorry for the uglyfied image.

scrapy_architecture2

@Gallaecio
Copy link
Member

I’m unsure whether this makes more sense in Scrapy or in https://github.com/scrapy/itemloaders now. If anyone has a better grasp and you think it makes sense to move the issue to https://github.com/scrapy/itemloaders, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants