Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

Closed
BurnzZ opened this issue May 26, 2019 · 8 comments
Closed

Comments

@BurnzZ
Copy link
Member

BurnzZ commented May 26, 2019

Related to issue #3771


I'm stoked with the idea of having the ItemLoader support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders' long term maintenance.

With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn't been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.

I'd like to discuss the idea of:

  1. how should this information be better presented?
  2. where might we put this info on, via the logs? via stats?
  3. should this feature be put into the ItemLoader class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?

and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?

Cheers!

@stav
Copy link
Contributor

stav commented May 26, 2019

I like this idea. It seems perhaps itemLoader.load_item() should write to stats.

@peonone
Copy link

peonone commented May 27, 2019

Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression.
I prefer to have it as a Scrapy Plugin, and use stats rather than logs for the hit count.

@akshayphilar
Copy link

Agree with @peonone. This is a potential Scrapy plugin and while the stats seem like the logical place for the selector distribution to appear, there is a good enough reason to also log this information, as doing so will not entail too much effort.

@kasun
Copy link
Contributor

kasun commented May 27, 2019

Definitely useful feature though I'm not sure whether this is directly related to #3771
I think it should go to stats, and prefer if this is builtin to scrapy itself.

@ejulio
Copy link
Contributor

ejulio commented May 27, 2019

I think this is a good idea, specially by stats increment.
I'd prefer it as a new library/plugin that would require only one line change.

from scrapy.loader import ItemLoader

class MyLoader(ItemLoader):
    pass

to

from scrapy_plugin import ItemLoader

class MyLoader(ItemLoader):
    pass

This would make it easier to enable the loader wherever we want without too many code changes.

Also, I think we'd need so sort of label for selectors, otherwise we'd end up with stats like scrapy_plugin/not_match/#id .class::text or scrapy_plugin/match/#id .class::text.
Maybe this is the best option, though we might write quite extensive selectors some times.

Also, should we consider (handle it differently) subselectors and nested loaders?

@BurnzZ
Copy link
Member Author

BurnzZ commented Jun 19, 2019

Hi everyone! As discussed, we'll start this one out as a separate Scrapy plugin. I've begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers!

@kmike
Copy link
Member

kmike commented Jun 26, 2019

FTR, we wanted to move ItemLoaders in a separate repo in past, and even created it: https://github.com/scrapy/scrapy-itemloader. There are two reasons: first is that it is not essential to core scrapy, and second - separate package would allow a separate release cycle, separate list of maintainers, etc. We haven't moved it back then, because ItemLoader requires Scrapy, and Scrapy requires ItemLoader. Maybe we should revive this work.

@BurnzZ
Copy link
Member Author

BurnzZ commented Oct 19, 2019

Let's close this issue as this feature has been established in https://github.com/BurnzZ/scrapy-loader-upkeep.

Moreover, we're thinking over the idea of separating Scrapy's ItemLoader as another package in #4005.

@BurnzZ BurnzZ closed this as completed Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants