Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

BurnzZ · 2019-05-26T06:06:24Z

Related to issue #3771

I'm stoked with the idea of having the ItemLoader support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders' long term maintenance.

With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn't been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.

I'd like to discuss the idea of:

how should this information be better presented?
where might we put this info on, via the logs? via stats?
should this feature be put into the ItemLoader class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?

and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?

Cheers!

The text was updated successfully, but these errors were encountered:

stav · 2019-05-26T13:05:53Z

I like this idea. It seems perhaps itemLoader.load_item() should write to stats.

peonone · 2019-05-27T02:35:15Z

Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression.
I prefer to have it as a Scrapy Plugin, and use stats rather than logs for the hit count.

akshayphilar · 2019-05-27T06:05:36Z

Agree with @peonone. This is a potential Scrapy plugin and while the stats seem like the logical place for the selector distribution to appear, there is a good enough reason to also log this information, as doing so will not entail too much effort.

kasun · 2019-05-27T06:36:32Z

Definitely useful feature though I'm not sure whether this is directly related to #3771
I think it should go to stats, and prefer if this is builtin to scrapy itself.

ejulio · 2019-05-27T12:28:56Z

I think this is a good idea, specially by stats increment.
I'd prefer it as a new library/plugin that would require only one line change.

from scrapy.loader import ItemLoader

class MyLoader(ItemLoader):
    pass

to

from scrapy_plugin import ItemLoader

class MyLoader(ItemLoader):
    pass

This would make it easier to enable the loader wherever we want without too many code changes.

Also, I think we'd need so sort of label for selectors, otherwise we'd end up with stats like scrapy_plugin/not_match/#id .class::text or scrapy_plugin/match/#id .class::text.
Maybe this is the best option, though we might write quite extensive selectors some times.

Also, should we consider (handle it differently) subselectors and nested loaders?

BurnzZ · 2019-06-19T13:28:59Z

Hi everyone! As discussed, we'll start this one out as a separate Scrapy plugin. I've begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers!

kmike · 2019-06-26T13:59:30Z

FTR, we wanted to move ItemLoaders in a separate repo in past, and even created it: https://github.com/scrapy/scrapy-itemloader. There are two reasons: first is that it is not essential to core scrapy, and second - separate package would allow a separate release cycle, separate list of maintainers, etc. We haven't moved it back then, because ItemLoader requires Scrapy, and Scrapy requires ItemLoader. Maybe we should revive this work.

BurnzZ · 2019-10-19T12:08:51Z

Let's close this issue as this feature has been established in https://github.com/BurnzZ/scrapy-loader-upkeep.

Moreover, we're thinking over the idea of separating Scrapy's ItemLoader as another package in #4005.

Gallaecio added discuss enhancement labels May 27, 2019

BurnzZ closed this as completed Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

BurnzZ commented May 26, 2019

stav commented May 26, 2019

peonone commented May 27, 2019

akshayphilar commented May 27, 2019

kasun commented May 27, 2019

ejulio commented May 27, 2019

BurnzZ commented Jun 19, 2019

kmike commented Jun 26, 2019 •

edited

Loading

BurnzZ commented Oct 19, 2019

Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

Fallback parser rules in ItemLoader - discussion for spider maintenance #3795

Comments

BurnzZ commented May 26, 2019

stav commented May 26, 2019

peonone commented May 27, 2019

akshayphilar commented May 27, 2019

kasun commented May 27, 2019

ejulio commented May 27, 2019

BurnzZ commented Jun 19, 2019

kmike commented Jun 26, 2019 • edited Loading

BurnzZ commented Oct 19, 2019

kmike commented Jun 26, 2019 •

edited

Loading