-
Notifications
You must be signed in to change notification settings - Fork 11k
S3 CsvItemExporter read of closed file error #5043 #5705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you, if this is indeed the issue, this is a great breakthrough! Any chance you can include a test that fails without your change and passes with it? Could this be affecting other exports? It sounds like maybe it is a more general problem that we need to fix on the code that calls exports rather than on export classes themselves. That said, I did not look into it in detail enough recently to be sure. |
Codecov Report
@@ Coverage Diff @@
## master #5705 +/- ##
==========================================
- Coverage 88.94% 88.93% -0.02%
==========================================
Files 162 162
Lines 11002 11004 +2
Branches 1798 1798
==========================================
Hits 9786 9786
- Misses 937 938 +1
- Partials 279 280 +1
|
I don't think that this is a problem in general, since JSON exporter is confirmed to work as expected. Although I haven't tested others, none of them uses IO stream, but are rather writing directly to the files. Regarding the tests, I am not certain that I will know how to do it, as I have never worked with twisted nor tox. I can take a brief look into that as well, but I cannot promise anything. |
Makes sense. I was going to suggest covering this in the docs about creating custom item exporters, but it turns out we have none 🤦 .
I must say I am not certain either how to test this. I am surprised it is not caught by existing tests. If you have the time and energy, please have a look at Otherwise, we can try to find the time to look at it ourselves, or try to get other contributors to chip in. |
scrapy/exporters.py
Outdated
# Detaching stream in order to avoid file closing. | ||
# The file will be closed with slot.storage.store | ||
# https://github.com/scrapy/scrapy/issues/5043 | ||
self.stream.detach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to happen on finish exporting, or can it happen on __init__
, right after initializing self.stream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to happen on finish exporting, or can it happen on
__init__
, right after initializingself.stream
?
I haven't actually tried that, but I doubt that it will work. From the documentation:
After the raw stream has been detached, the buffer is in an unusable state.
and thus it seems to me that it would make sense to detach the stream with finish_exporting
.
Maybe it is a race condition, that in the tests we are closing
file
before the exporter class gets destroyed?
I am pretty certain that this is the case since while testing I managed to avoid the problem without detaching input stream by sleeping feedexport for a few seconds after slot is closed, but before a new batch is started:
scrapy/scrapy/extensions/feedexport.py
Lines 437 to 443 in c56caaf
self._close_slot(slot, spider) | |
slots.append(self._start_new_batch( | |
batch_id=slot.batch_id + 1, | |
uri=slot.uri_template % uri_params, | |
feed_options=self.feeds[slot.uri_template], | |
spider=spider, | |
uri_template=slot.uri_template, |
We might need to sleep the code somewhere in order to obtain successful tests?
I tried a few things regarding test and didn't get anywhere. I think that the problem might be because Stubber is not actually sending requests and thus the test never reaches a part where a closed CSV file is to be read... |
No worries, we will try and find the time to work on that test ourselves. Great work! |
Thanks! |
Fixes #5043
After some debugging I figure out that this is happening because CsvItemExporter uses TextIOWrapper stream for creation of CSV-s:
class CsvItemExporter(BaseItemExporter):
scrapy/scrapy/exporters.py
Lines 209 to 223 in 8004075
According to the io documentation
TextIOWrapper
inherits fromio.IOBase
which by default callsclose()
methon on stream on object destruction__del()__
.A simple fix would be to call
self.stream.detach()
method on finish exporting.Ps. this is my first open source contribution :)