Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move item loaders to a separate library #4005

Closed
kmike opened this issue Sep 7, 2019 · 8 comments · Fixed by #4516
Closed

move item loaders to a separate library #4005

kmike opened this issue Sep 7, 2019 · 8 comments · Fixed by #4516

Comments

@kmike
Copy link
Member

kmike commented Sep 7, 2019

Summary

Move item loaders to a separate repository, similar to parsel.

Motivation

There are two main reasons for doing so:

  1. If item loaders are decoupled from Scrapy, they can be used without Scrapy. They would still support CSS/XPath selectors, but use parsel to populate dicts. For example, this would allow to use item loaders in PageObjects from https://github.com/scrapinghub/scrapy-po, and allow reusing them without Scrapy.
  2. It would allow to develop item loaders separately, make it easier to have a different set of maintainers, allow a separate release cycle. Currently they are under-maintained in Scrapy.

On a first sight, most of the item loaders code is not tied to Scrapy already.

Previous attempts

We tried to move itemloaders in past, by copying the current implementation. Decision was not to go with this, as Scrapy will have to depend on scrapy-itemloaders, and scrapy-itemloaders will have to depend on Scrapy. In this proposal Scrapy depends on a new package, but package can be used without Scrapy - it depends on parsel only.

How to do it

  1. re-create https://github.com/scrapy/scrapy-itemloader repo - import current code (with git history). Or maybe use a different name for a repository - just "itemloaders"? We can rename the repo.
  2. remove Scrapy-specific parts:
    • "response" argument, response object in loader context
    • scrapy Item support. It can be re-added for attr.s or dataclass-based items, when Scrapy gets support for them.
  3. implement scrapy.loaders on top of a new library - inherit, add response and item support.
  4. Provide backwards compatibility shims for scrapy.loader.processors

Other considerations

Optional scrapy.Item support in the library itself can be fine.

@BurnzZ
Copy link
Member

BurnzZ commented Oct 7, 2019

Hi @kmike , if all goes well on this plan, I'd love to merge my https://github.com/BurnzZ/scrapy-loader-upkeep into the separate repo as it would make sense to combine them there. We'll probably need to refactor it a bit but I'm happy to work on it nonetheless.

@kmike
Copy link
Member Author

kmike commented Oct 7, 2019

Hi @BurnzZ! AFAIK nobody started to work on extracting itemloaders repo yet, and there are no concrete plans for this. So I wonder if you can help with it as well, as a first step :) Ideally, it should involve some git voodoo to preserve history, like we did it for parsel (see #906).

@BurnzZ
Copy link
Member

BurnzZ commented Oct 9, 2019

Ahh yes preserving the history would bit of a challenge. Thanks for linking it up @kmike! I'll read that up and see what we can do for this. :)

@ejulio
Copy link
Contributor

ejulio commented Apr 16, 2020

Started this extraction here https://github.com/ejulio/scrapy-itemloaders
So far, just kept git history http://savorywatt.com/2015/01/25/move-files-and-folders-between-git-repos-using-patches/

  • Keep teim and Field in scrapy, just move loaders and processors
  • Keep backwards compatibility (include an ItemLoader in scrapy as a reference to this external library)

@kmike
Copy link
Member Author

kmike commented Apr 16, 2020

@ejulio it seems history before move from scrapy.contrib.loader to scrapy.loader is not preserved. Do you think there is a wayto keep it?

@ejulio
Copy link
Contributor

ejulio commented Apr 16, 2020

@kmike , I guess not.
As it's not part of the file history in scrapy itself.
Scrapy history https://github.com/scrapy/scrapy/commits/master/scrapy/loader/__init__.py
New history https://github.com/ejulio/scrapy-itemloaders/commits/master/scrapy/loader/__init__.py

I'll check if it is possible..

@kmike
Copy link
Member Author

kmike commented Apr 16, 2020

You can get the history for previous folder e.g. here https://github.com/scrapy/scrapy/commits/569156be190fb7e86108c29001d6233a6698a510/scrapy/contrib/loader. But yeah, I understand this can be complicated.

@ejulio
Copy link
Contributor

ejulio commented Apr 17, 2020

@kmike , I managed to keep both commit histories https://github.com/ejulio/scrapy-itemloaders/commits/master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants