-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallback parser rules in ItemLoader - discussion for spider maintenance #3795
Comments
I like this idea. It seems perhaps |
Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression. |
Agree with @peonone. This is a potential Scrapy plugin and while the stats seem like the logical place for the selector distribution to appear, there is a good enough reason to also log this information, as doing so will not entail too much effort. |
Definitely useful feature though I'm not sure whether this is directly related to #3771 |
I think this is a good idea, specially by stats increment.
to
This would make it easier to enable the loader wherever we want without too many code changes. Also, I think we'd need so sort of label for selectors, otherwise we'd end up with stats like Also, should we consider (handle it differently) subselectors and nested loaders? |
Hi everyone! As discussed, we'll start this one out as a separate Scrapy plugin. I've begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers! |
FTR, we wanted to move ItemLoaders in a separate repo in past, and even created it: https://github.com/scrapy/scrapy-itemloader. There are two reasons: first is that it is not essential to core scrapy, and second - separate package would allow a separate release cycle, separate list of maintainers, etc. We haven't moved it back then, because ItemLoader requires Scrapy, and Scrapy requires ItemLoader. Maybe we should revive this work. |
Let's close this issue as this feature has been established in https://github.com/BurnzZ/scrapy-loader-upkeep. Moreover, we're thinking over the idea of separating Scrapy's |
Related to issue #3771
I'm stoked with the idea of having the
ItemLoader
support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders' long term maintenance.With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn't been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.
I'd like to discuss the idea of:
logs
? viastats
?ItemLoader
class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?
Cheers!
The text was updated successfully, but these errors were encountered: