Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'item_classes' key to FEEDS #4576

Closed
wants to merge 23 commits into from
Closed

Conversation

longkyle
Copy link

@longkyle longkyle commented May 14, 2020

Fixes #4575

I think it would be nice to be able to filter which items get exported to each URI in FEEDS.

This update allows for an 'item_classes' key (optional) to be added to FEEDS. If 'item_classes' is not provided, it defaults to None which includes all item classes. This behavior is meant to mimic the behavior of the 'fields' key.

from my_project.items import ProductItem, AnotherItem

FEEDS = {
        'product_items.json': {
            'format': 'json',
            'item_classes': ProductItem                      <-- scrapy.Item object
        },
        'other_items.json': {
            'format': 'json',
            'item_classes': (ProductItem, AnotherItem)       <-- set of scrapy.Item objects
        },
        'all_items.json': {
            'format': 'json'
        }
    }

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea sounds good, in fact I think I saw a question about it some time ago. I'm not sure I like the name though, I think it's not clear enough. Would item_classes be better?

scrapy/exporters.py Outdated Show resolved Hide resolved
scrapy/exporters.py Outdated Show resolved Hide resolved
tests/test_feedexport.py Outdated Show resolved Hide resolved
scrapy/exporters.py Outdated Show resolved Hide resolved
@longkyle
Copy link
Author

@elacuesta Thanks for taking the time to review my code!
I think you're right about naming the key, item_classes thanks for the suggestion! I'll update with your other suggestions as well.

Thanks again!

@longkyle longkyle requested a review from elacuesta May 14, 2020 22:32
scrapy/exporters.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member

Many thanks for the updates. Could you take a look at the Travis failures?

@longkyle
Copy link
Author

@elacuesta I've spent quite a bit of time looking into the travis errors and am not sure what's going on, unfortunately. The tests pass for me locally but seem to fail unpredictably in travis. Might you be able to take a quick look and just see if anything stands out? Basically, the order of fields seem to get saved in different orders depending on the test/environment?

Attached is an image showing that if I correct for the travis test, it fails locally, and vice versa.
Screen Shot 2020-05-14 at 6 26 37 PM

@Gallaecio
Copy link
Member

Yes, that’s because we still support Python 3.5, where the key order of dictionaries is not kept.
You could change the tests to check that the values are both bytes, then load them with json.loads, and then check that they match.

@elacuesta
Copy link
Member

Indeed. Another thing you could do would be to use items with only one field, as the test_export_multiple_configs and test_export_indentation methods do, for instance.

@longkyle longkyle changed the title Add 'items' key to FEEDS Add 'item_classes' key to FEEDS May 15, 2020
@codecov
Copy link

codecov bot commented May 15, 2020

Codecov Report

Merging #4576 into master will increase coverage by 0.19%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4576      +/-   ##
==========================================
+ Coverage   84.55%   84.75%   +0.19%     
==========================================
  Files         164      163       -1     
  Lines        9923     9974      +51     
  Branches     1475     1487      +12     
==========================================
+ Hits         8390     8453      +63     
+ Misses       1266     1254      -12     
  Partials      267      267              
Impacted Files Coverage Δ
scrapy/exporters.py 100.00% <100.00%> (ø)
scrapy/extensions/feedexport.py 84.68% <100.00%> (+0.20%) ⬆️
scrapy/utils/conf.py 93.13% <100.00%> (+0.06%) ⬆️
scrapy/__init__.py 84.21% <0.00%> (-1.51%) ⬇️
scrapy/downloadermiddlewares/cookies.py 95.74% <0.00%> (-1.14%) ⬇️
scrapy/spiders/crawl.py 92.39% <0.00%> (-0.09%) ⬇️
scrapy/shell.py 68.21% <0.00%> (ø)
scrapy/cmdline.py 67.74% <0.00%> (ø)
scrapy/robotstxt.py 97.53% <0.00%> (ø)
scrapy/selector/__init__.py 100.00% <0.00%> (ø)
... and 10 more

@Gallaecio
Copy link
Member

Gallaecio commented May 15, 2020

To avoid having to add those 2 lines to every exporter, I think we should consider applying the option through _FeedSlot instead. See #4434.

There is a slot.exporter.export_item line in feedexport.py. You could implement a method of _FeedSlot that starts with those two lines that check the item class, and then call slot.exporter.export_item, and you could replace the current call to slot.exporter.export_item with a call to that new slot.export_item method.

scrapy/exporters.py Outdated Show resolved Hide resolved
longkyle and others added 2 commits May 15, 2020 13:38
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
@elacuesta
Copy link
Member

@longkyle Seeing the message for f97b521, I'd like to mention that you can trigger a new build by closing and re-opening the PR. Additionally, I'd recommend you to run the tests locally to spot any failures before pushing, it's usually much more comfortable and fast than waiting for the Travis build. See this section for some guidelines on how to run tests.

@elacuesta elacuesta closed this May 15, 2020
@elacuesta elacuesta reopened this May 15, 2020
@elacuesta
Copy link
Member

For the record, the last build error is unrelated to this PR 👍

@longkyle
Copy link
Author

@elacuesta Thanks for the tips! I had to rerun one of them because travis-ci said that everything passed but still said "pending" on github. It hung there for a few hours. If that last build error is unrelated then we should be good to go. (although I pushed another commit right before I read your tips face palm)

Anyway, thanks!

@elacuesta
Copy link
Member

Implementation is looking great, but I think we need to cover this more extensively in the docs. It's a useful feature and I think it's a bit hidden, only mentioned once in the allowed keys for the FEEDS setting and without any context. Given that it has no fallback setting, I think a small paragraph after the FEEDS docs could be enough, with a versionadded directive (possibly 2.2 if we include it in the next release, but we can always update later if that's not the case).

Co-authored-by: Adrián Chaves <adrian@chaves.io>
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
longkyle and others added 2 commits May 20, 2020 14:07
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@StasDeep
Copy link
Contributor

StasDeep commented Jun 1, 2020

Hello! Just saw this and I like the idea, but I am now thinking maybe it makes more sense to have a filter behavior in the FeedExporter rather than in ItemExporter? It just sounds a bit off to me to make ItemExporter to decide whether an item should be exported or not.

@Gallaecio
Copy link
Member

Sounds good to me.

I don’t think we need to implement support for complex filters in this pull request, though, unless @longkyle wishes to do it. I think here we could simply replace item_classes with filter, and make the necessary changes to move the logic out of the item exporters into the feed exporter.

That would leave room for implementing support for a more complex filter in a later pull request, where we can extend this filter to, in addition of Item subclasses and sequences of them, allow for a filtering component (e.g. the import path of a class supporting from_crawler and from_settings and implementing a filtering method).

@@ -288,8 +294,7 @@ def close_spider(self, spider):
def item_scraped(self, item, spider):
for slot in self.slots:
slot.start_exporting()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may start export without a need, if an item would be filtered out.

Let's say you have a single FEEDS configuration in settings.py, which defines 2 export locations, split by 2 item classes. There are 2 spiders; one spider emits items of type A, the other emits items of type B. While running a spider, empty (content-wise) file can be created for the item type which spider doesn't emit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see how we can avoid this. I think in this scenario FEEDS would need to be redefined on each spider, if creating an empty file is an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside slot.export_item() (called in a next line) there is a if self.exporter._should_export_item(item): check. My point was that it could make sense to call slot.start_exporting() only after this check is successful at least once. It won't fix all scenarios where an empty file can appear, but, if I'm not mistaken, this should fix the problem I described.

@kmike
Copy link
Member

kmike commented Sep 2, 2020

I think that's a great feature, with many potential use cases. It plays nicely with attrs and dataclass items as well.

@codekoriko
Copy link

Works great 👍 however the FEED_STORE_EMPTY settings has no more effect. Feeds without any items in them still have their file created.

@ejulio
Copy link
Contributor

ejulio commented Jan 25, 2021

@longkyle do you intend to keep the work here?
I'd add that we could rename item_classes to match and then it can be: a class, a tuple of classes, a matching function for any complex behavior from the user perspective.
Also if the previous comment about FEED_STORE_EMPTY holds, this is a regression that we need to fix and leave it explicit in the docs as it can break some stuff

@longkyle
Copy link
Author

@ejulio I was initially excited about contributing this feature to the community. It was approved and then just sat here without getting folded into the new releases. Unfortunately, I don't have the time to pick this back up at this point. If anybody else would like to pickup the torch, I'd be happy to see this feature get implemented!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 'item_classes' key to FEEDS
7 participants