-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileFeedStorage creates empty file when no items are scraped #872
Comments
@gbirke I have met the same problem, there are so many empty files when you save data to files, and your proposal sounds like a good idea to me. |
I'm having a similar issue, where the |
FileFeedStorage left empty files when no items were scraped. This patch adds a cleanup method to the IFeedStorage interface that will be called by FeedExporter when no items were scraped. Fixes scrapy#872
The bug has not been resolved yet. I don't believe the resolution proposed in the PR #2258 was sufficient. Although it may fix the issue of broken JSON, the problem of FileFeedStorage creating empty files remains.
|
So I hope this issue will be reopened. |
While removing the file may cause problems with appending, I think delaying the direct opening of the file would be able to work well with the appending behavior and prevent the creation of an empty file. |
Since there doesn't seem to be much demand for the patch, I am thinking of making my own patch. However, I am not sure which of the two solutions suggested earlier is better. |
My gut says postponing file creation until there is an item to write is the way to go, however ugly it might be. But my opinion may change depending on how ugly things can get 😅 |
Let's see how the return value of IFeedStorage.open() is used. scrapy/scrapy/extensions/feedexport.py Lines 426 to 450 in 8fbebfa
Here is the first problem. The _FeedSlot is a private class within feedexporter, so it is easy to fix. ItemExporter, on the other hand, is a class outside of the module. I don't yet know the inner workings of scrapy, so I don't know how much modification to the source code would be required to change the interface of the ItemExporter. |
Then it seems like a simple answer to let _FeedSlot manage whether the file and ItemExporter are loaded. |
I have finished creating the fix and am checking for conflicts with the test, but in doing so I am not sure if this should be done as a "fix". At least the docs.
I feel this part is one of the reasons for this confusion. |
To me the documentation reads clearly enough: an empty feed is a feed with no items, and That doesn’t mean we cannot improve it, e.g. by explicitly saying that a file without items is not even created. But it means that it feels like considering the creation of a file a bug when |
If the current behavior is a bug, then I feel it is more necessary to fix the behavior of other features as well. |
That’s a very good question, and I am not entirely sure of what would be best. I think it may make sense to do that, but only for Changing the global default may not be good, since it would be a backward incompatible case… unless there is no single scenario where this bug does not happen, in which case |
A modest change in behavior will occur. |
Apparently, I can indirectly solve #5633 and directly solve the issue mentioned in this PR while solving this issue. |
When no items are scraped, the corresponding file is created none the less, because it is created in by the
storage.open
call inFeedExporter.open_spider
. This behavior ignores the the setting ofFEED_STORE_EMPTY
when using file export.My proposal for this would be to add a
cleanup
method to theIFeedStorage
interface. ThenFeedExporter.close_spider
can call that method before returning in caseslot.itemcount
is zero andself.store_empty
isFalse
.cleanup
could also be called internally from thestore
methods of theIFeedStorage
implementations.The text was updated successfully, but these errors were encountered: