-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'item_classes' key to FEEDS #4576
Conversation
Merge branch 'master' of https://github.com/scrapy/scrapy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea sounds good, in fact I think I saw a question about it some time ago. I'm not sure I like the name though, I think it's not clear enough. Would item_classes
be better?
@elacuesta Thanks for taking the time to review my code! Thanks again! |
Many thanks for the updates. Could you take a look at the Travis failures? |
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
Merge branch 'master' of https://github.com/longkyle/scrapy
@elacuesta I've spent quite a bit of time looking into the travis errors and am not sure what's going on, unfortunately. The tests pass for me locally but seem to fail unpredictably in travis. Might you be able to take a quick look and just see if anything stands out? Basically, the order of fields seem to get saved in different orders depending on the test/environment? Attached is an image showing that if I correct for the travis test, it fails locally, and vice versa. |
Yes, that’s because we still support Python 3.5, where the key order of dictionaries is not kept. |
Indeed. Another thing you could do would be to use items with only one field, as the |
Codecov Report
@@ Coverage Diff @@
## master #4576 +/- ##
==========================================
+ Coverage 84.55% 84.75% +0.19%
==========================================
Files 164 163 -1
Lines 9923 9974 +51
Branches 1475 1487 +12
==========================================
+ Hits 8390 8453 +63
+ Misses 1266 1254 -12
Partials 267 267
|
To avoid having to add those 2 lines to every exporter, I think we should consider applying the option through There is a |
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
@longkyle Seeing the message for f97b521, I'd like to mention that you can trigger a new build by closing and re-opening the PR. Additionally, I'd recommend you to run the tests locally to spot any failures before pushing, it's usually much more comfortable and fast than waiting for the Travis build. See this section for some guidelines on how to run tests. |
For the record, the last build error is unrelated to this PR 👍 |
@elacuesta Thanks for the tips! I had to rerun one of them because travis-ci said that everything passed but still said "pending" on github. It hung there for a few hours. If that last build error is unrelated then we should be good to go. (although I pushed another commit right before I read your tips face palm) Anyway, thanks! |
Implementation is looking great, but I think we need to cover this more extensively in the docs. It's a useful feature and I think it's a bit hidden, only mentioned once in the allowed keys for the |
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Hello! Just saw this and I like the idea, but I am now thinking maybe it makes more sense to have a filter behavior in the FeedExporter rather than in ItemExporter? It just sounds a bit off to me to make ItemExporter to decide whether an item should be exported or not. |
Sounds good to me. I don’t think we need to implement support for complex filters in this pull request, though, unless @longkyle wishes to do it. I think here we could simply replace That would leave room for implementing support for a more complex filter in a later pull request, where we can extend this |
@@ -288,8 +294,7 @@ def close_spider(self, spider): | |||
def item_scraped(self, item, spider): | |||
for slot in self.slots: | |||
slot.start_exporting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may start export without a need, if an item would be filtered out.
Let's say you have a single FEEDS configuration in settings.py, which defines 2 export locations, split by 2 item classes. There are 2 spiders; one spider emits items of type A, the other emits items of type B. While running a spider, empty (content-wise) file can be created for the item type which spider doesn't emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t see how we can avoid this. I think in this scenario FEEDS
would need to be redefined on each spider, if creating an empty file is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside slot.export_item()
(called in a next line) there is a if self.exporter._should_export_item(item):
check. My point was that it could make sense to call slot.start_exporting()
only after this check is successful at least once. It won't fix all scenarios where an empty file can appear, but, if I'm not mistaken, this should fix the problem I described.
I think that's a great feature, with many potential use cases. It plays nicely with attrs and dataclass items as well. |
Works great 👍 however the FEED_STORE_EMPTY settings has no more effect. Feeds without any items in them still have their file created. |
@longkyle do you intend to keep the work here? |
@ejulio I was initially excited about contributing this feature to the community. It was approved and then just sat here without getting folded into the new releases. Unfortunately, I don't have the time to pick this back up at this point. If anybody else would like to pickup the torch, I'd be happy to see this feature get implemented! |
Fixes #4575
I think it would be nice to be able to filter which items get exported to each URI in FEEDS.
This update allows for an 'item_classes' key (optional) to be added to FEEDS. If 'item_classes' is not provided, it defaults to
None
which includes all item classes. This behavior is meant to mimic the behavior of the 'fields' key.