Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileFeedStorage creates empty file when no items are scraped #872

Closed
gbirke opened this issue Sep 2, 2014 · 15 comments · Fixed by #5847
Closed

FileFeedStorage creates empty file when no items are scraped #872

gbirke opened this issue Sep 2, 2014 · 15 comments · Fixed by #5847

Comments

@gbirke
Copy link

gbirke commented Sep 2, 2014

When no items are scraped, the corresponding file is created none the less, because it is created in by the storage.open call in FeedExporter.open_spider. This behavior ignores the the setting of FEED_STORE_EMPTY when using file export.

My proposal for this would be to add a cleanup method to the IFeedStorage interface. Then FeedExporter.close_spider can call that method before returning in case slot.itemcount is zero and self.store_empty is False. cleanup could also be called internally from the store methods of the IFeedStorage implementations.

@michael-yin
Copy link

@gbirke I have met the same problem, there are so many empty files when you save data to files, and your proposal sounds like a good idea to me. IFeedStorage need to be added a new method called in close_spider of FeedExporter

@dfockler
Copy link

I'm having a similar issue, where the JSONItemExporter will write out the opening [ when calling it's start_exporting method but if there are no items scraped, it's finish_exporting method never gets called, and the closing ] is never added, so any resources consuming that JSON file will throw errors because of the invalid JSON format.

gbirke added a commit to gbirke/scrapy that referenced this issue Jan 10, 2015
FileFeedStorage left empty files when no items were scraped. This patch
adds a cleanup method to the IFeedStorage interface that will be called
by FeedExporter when no items were scraped.

Fixes scrapy#872
redapple added a commit to redapple/scrapy that referenced this issue Sep 16, 2016
@namelessGonbai
Copy link
Contributor

The bug has not been resolved yet. I don't believe the resolution proposed in the PR #2258 was sufficient. Although it may fix the issue of broken JSON, the problem of FileFeedStorage creating empty files remains.
This is because FileFeedStorage.open open the actual file within it, which generates the file even if there is no writing happening.
In my opinion, one of the following is necessary:

  1. Only open the file when the first item is received, similar to how start_exporting works.
  2. Make FileFeedStorage use a TemporaryFile like BlockingFeedStorage does.

@namelessGonbai
Copy link
Contributor

So I hope this issue will be reopened.

@namelessGonbai
Copy link
Contributor

#916 (comment)

The problem with removing file is that it doesn't play nicely with the default behavior of appending to the result.

While removing the file may cause problems with appending, I think delaying the direct opening of the file would be able to work well with the appending behavior and prevent the creation of an empty file.

@Gallaecio Gallaecio reopened this Feb 24, 2023
@namelessGonbai
Copy link
Contributor

Since there doesn't seem to be much demand for the patch, I am thinking of making my own patch. However, I am not sure which of the two solutions suggested earlier is better.
Creating a temporary file and writing to it seems like unnecessary overhead. On the other hand, opening the file for the first time at the moment of writing and trying to write to it would require a major change in logic and look bad.

@Gallaecio
Copy link
Member

My gut says postponing file creation until there is an item to write is the way to go, however ugly it might be. But my opinion may change depending on how ugly things can get 😅

@namelessGonbai
Copy link
Contributor

Let's see how the return value of IFeedStorage.open() is used.

file = storage.open(spider)
if "postprocessing" in feed_options:
file = PostProcessingManager(
feed_options["postprocessing"], file, feed_options
)
exporter = self._get_exporter(
file=file,
format=feed_options["format"],
fields_to_export=feed_options["fields"],
encoding=feed_options["encoding"],
indent=feed_options["indent"],
**feed_options["item_export_kwargs"],
)
slot = _FeedSlot(
file=file,
exporter=exporter,
storage=storage,
uri=uri,
format=feed_options["format"],
store_empty=feed_options["store_empty"],
batch_id=batch_id,
uri_template=uri_template,
filter=self.filters[uri_template],
)

Here is the first problem.
The _FeedSlot is a private class within feedexporter, so it is easy to fix. ItemExporter, on the other hand, is a class outside of the module. I don't yet know the inner workings of scrapy, so I don't know how much modification to the source code would be required to change the interface of the ItemExporter.

@namelessGonbai
Copy link
Contributor

Then it seems like a simple answer to let _FeedSlot manage whether the file and ItemExporter are loaded.
The _FeedSlot is given the necessary settings to open the file and initialize the ItemExporter in the constructor, and then processes them when the file or ItemExporter is first requested.

@namelessGonbai
Copy link
Contributor

I have finished creating the fix and am checking for conflicts with the test, but in doing so I am not sure if this should be done as a "fix".
Is this something that should be a destructive change to FEED_STORE_EMPTY as a bug...? Or should I create a new option to not export files when empty feeds?

At least the docs.

FEED_STORE_EMPTY
Default: False
Whether to export empty feeds (i.e. feeds with no items).

I feel this part is one of the reasons for this confusion.
What does "Whether" mean?
Write out an empty feed or nothing?
Does "Nothing" mean no file is created? Does it mean an empty file is created?

@Gallaecio
Copy link
Member

To me the documentation reads clearly enough: an empty feed is a feed with no items, and FEED_STORE_EMPTY=False should cause such feeds to not be exported.

That doesn’t mean we cannot improve it, e.g. by explicitly saying that a file without items is not even created. But it means that it feels like considering the creation of a file a bug when FEED_STORE_EMPTY=False seems OK to me.

@namelessGonbai
Copy link
Contributor

If the current behavior is a bug, then I feel it is more necessary to fix the behavior of other features as well.
This fix results in the command line option -O or --overwrite-output no longer creating an empty file, which causes the test to fail.
It is easy to modify the test to a test case such that the item exists, but I feel that is not testing an edge case.
When fixing this Issue as a bug, it feels more natural to implicitly set FEED_STORE_EMPTY=True when --output or --overwrite-output, or to set the default for FEED_STORE_EMPTY to True. What would you suggest?

@Gallaecio
Copy link
Member

Gallaecio commented Feb 27, 2023

It feels more natural to implicitly set FEED_STORE_EMPTY=True when --output or --overwrite-output, or to set the default for FEED_STORE_EMPTY to True. What would you suggest?

That’s a very good question, and I am not entirely sure of what would be best. I think it may make sense to do that, but only for -O, not for -o, and in any case it should be possible to override with -s FEED_STORE_EMPTY=False.

Changing the global default may not be good, since it would be a backward incompatible case… unless there is no single scenario where this bug does not happen, in which case FEED_STORE_EMPTY=True has been the de-facto default at least since the bug was introduced, and making it the actual default as we fix the bug would be less backward-incompatible.

@kmike @wRAR Any thoughts?

@namelessGonbai
Copy link
Contributor

unless there is no single scenario where this bug does not happen, in which case FEED_STORE_EMPTY=True has been the de-facto default at least since the bug was introduced,

A modest change in behavior will occur.
Whereas before a completely empty file was generated, that modification will now generate a formatted empty representation, e.g., [] for json, xml declaration and empty elements for xml, etc.
However, my personal opinion is that empty files were never really expected in the first place.

@namelessGonbai
Copy link
Contributor

Apparently, I can indirectly solve #5633 and directly solve the issue mentioned in this PR while solving this issue.
The creation of empty files during batch testing is essentially the same problem as this issue, and could be solved at the same time by adding an exception to match the batch process.
Since the empty file creation will no longer occur, there is no need for Windows exception handling related to this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants