Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FeedExporter not to export empty file #5847

Merged
merged 8 commits into from
Jul 22, 2023

Conversation

namelessGonbai
Copy link
Contributor

@namelessGonbai namelessGonbai commented Mar 8, 2023

Fixes #872, fixes #5633

  • Fix FeedExporter not to export empty file
  • Change default value of FEED_STORE_EMPTY

There are several problems with this patch.

  1. There is no additional test.

The explanation for the FeedExporter fix is within the scope of what is already done in the existing tests, so we believe that the fixes to the existing tests are sufficient for testing.
However, I have no idea how to test for changes in defaults, even though I think the magnitude of the effect is clearly significant and needs to be tested.

  1. Meaning of test_export_no_items_multiple_feeds()

This test from #4621.
This test seems to be testing whether a file descriptor leak has been resolved when FEED_STORE_EMPTY=False.
FEED_STORE_EMPTY=False will no longer open the file if the item is missing and this test is no longer meaningful.
This test needs to be fixed (or deleted) because I made a fix that twisted its original meaning to pass the test.

+ Fix FeedExporter not to export empty file
+ Change default value of FEED_STORE_EMPTY
@namelessGonbai
Copy link
Contributor Author

I forgot to mention this as well. Another issue with this patch is whether changing the default values is acceptable.

@codecov
Copy link

codecov bot commented Mar 8, 2023

Codecov Report

Merging #5847 (8b18677) into master (9e0bfc4) will increase coverage by 0.06%.
The diff coverage is 97.27%.

❗ Current head 8b18677 differs from pull request most recent head e71d6d6. Consider uploading reports for the commit e71d6d6 to get more accurate results

@@            Coverage Diff             @@
##           master    #5847      +/-   ##
==========================================
+ Coverage   88.81%   88.88%   +0.06%     
==========================================
  Files         162      162              
  Lines       11186    11253      +67     
  Branches     1819     1827       +8     
==========================================
+ Hits         9935    10002      +67     
  Misses        965      965              
  Partials      286      286              
Impacted Files Coverage Δ
scrapy/__init__.py 84.21% <0.00%> (ø)
scrapy/downloadermiddlewares/httpcache.py 93.97% <ø> (ø)
scrapy/cmdline.py 67.96% <50.00%> (ø)
scrapy/downloadermiddlewares/decompression.py 100.00% <100.00%> (ø)
scrapy/downloadermiddlewares/retry.py 100.00% <100.00%> (ø)
scrapy/extensions/feedexport.py 95.63% <100.00%> (+<0.01%) ⬆️
scrapy/http/headers.py 98.52% <100.00%> (ø)
scrapy/pipelines/files.py 72.00% <100.00%> (ø)
scrapy/settings/default_settings.py 98.77% <100.00%> (+<0.01%) ⬆️
scrapy/utils/datatypes.py 100.00% <100.00%> (ø)
... and 2 more

... and 3 files with indirect coverage changes

@namelessGonbai
Copy link
Contributor Author

I forgot to state this explicitly.
I would like a review of the problems I listed above.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to test for changes in defaults, even though I think the magnitude of the effect is clearly significant and needs to be tested.

I don’t think we need a specific test for defaults, I think your test changes already cover the default value change as far as we need.

This test needs to be fixed (or deleted) because I made a fix that twisted its original meaning to pass the test.

I agree, and I think changing expectations is the way to go:

--- a/tests/test_feedexport.py
+++ b/tests/test_feedexport.py
@@ -1057,13 +1057,13 @@ class FeedExportTest(FeedExportTestBase):
                 self._random_temp_filename(): {"format": "csv"},
             },
             "FEED_STORAGES": {"file": LogOnStoreFileStorage},
+            "FEED_STORE_EMPTY": False,
         }
 
         with LogCapture() as log:
             yield self.exported_no_data(settings)
 
-        print(log)
-        self.assertEqual(str(log).count("Storage.store is called"), 3)
+        self.assertEqual(str(log).count("Storage.store is called"), 0)
 
     @defer.inlineCallbacks
     def test_export_multiple_item_classes(self):

Co-authored-by: Adrián Chaves <adrian@chaves.io>
@namelessGonbai
Copy link
Contributor Author

Error. What happened.
The error only occurred with test_batch_path_differ in python 3.8.

@namelessGonbai
Copy link
Contributor Author

Findings.
The test had a potential bug.

settings = {
"FEEDS": {
self._random_temp_filename()
/ "%(batch_time)s": {
"format": "json",
},
},
"FEED_EXPORT_BATCH_ITEM_COUNT": 1,
}

This will overwrite the same file when the file write is fast enough and the batch_time matches.

@namelessGonbai
Copy link
Contributor Author

But if so, isn't this error not limited to this test? Theoretically, if only %(batch_time)s are used, it could occur in practical use as well?
Isn't this at least a larger problem that is outside the scope of this patch?

@namelessGonbai
Copy link
Contributor Author

I can rewrite the whole test to %(batch_id)d if you need to fix it at least, and I'm sure it's the type of error that disappears when you re-run the runner in the first place, though.

@namelessGonbai namelessGonbai changed the title [WIP]Fix FeedExporter not to export empty file Fix FeedExporter not to export empty file Mar 16, 2023
@Gallaecio
Copy link
Member

I think I have seen that test fail before, I would not worry about it here, fixing it is out of the scope of this change.

@namelessGonbai
Copy link
Contributor Author

No merge or additional review yet, any problems?

@wRAR wRAR added this to the Scrapy 2.10 milestone May 4, 2023
@@ -277,18 +277,21 @@ def _store_in_thread(self, file):
class FeedSlot:
def __init__(
self,
file,
exporter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now changes a public class, as we made FeedSlot public recently. @kmike @Gallaecio thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only backward-compatible option I can think of at the moment is to keep the current class unchanged, deprecate it, and create a new class with the new API with a new class name (BatchSlot?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided that as this only changes the __init__ signature and the current usage of this class is to receive it in the feed_slot_closed signal handler, it's fine to just change this.

@Gallaecio Gallaecio requested a review from kmike June 23, 2023 07:21
@kmike
Copy link
Member

kmike commented Jul 22, 2023

Thanks @namelessGonbai, @wRAR and @Gallaecio!

@kmike kmike merged commit af2aa4b into scrapy:master Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants