Skip to content

S3 CsvItemExporter read of closed file error #5043 #5705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 13, 2023
Merged

S3 CsvItemExporter read of closed file error #5043 #5705

merged 4 commits into from
Jun 13, 2023

Conversation

srki24
Copy link

@srki24 srki24 commented Nov 4, 2022

Fixes #5043

After some debugging I figure out that this is happening because CsvItemExporter uses TextIOWrapper stream for creation of CSV-s:
class CsvItemExporter(BaseItemExporter):

scrapy/scrapy/exporters.py

Lines 209 to 223 in 8004075

class CsvItemExporter(BaseItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', errors=None, **kwargs):
super().__init__(dont_fail=True, **kwargs)
if not self.encoding:
self.encoding = 'utf-8'
self.include_headers_line = include_headers_line
self.stream = io.TextIOWrapper(
file,
line_buffering=False,
write_through=True,
encoding=self.encoding,
newline='', # Windows needs this https://github.com/scrapy/scrapy/issues/3034
errors=errors,
)

According to the io documentation TextIOWrapper inherits from io.IOBase which by default calls close() methon on stream on object destruction __del()__.

A simple fix would be to call self.stream.detach() method on finish exporting.

Ps. this is my first open source contribution :)

@Gallaecio
Copy link
Member

Thank you, if this is indeed the issue, this is a great breakthrough!

Any chance you can include a test that fails without your change and passes with it?

Could this be affecting other exports? It sounds like maybe it is a more general problem that we need to fix on the code that calls exports rather than on export classes themselves. That said, I did not look into it in detail enough recently to be sure.

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Merging #5705 (426f3eb) into master (c34ca4a) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5705      +/-   ##
==========================================
- Coverage   88.94%   88.93%   -0.02%     
==========================================
  Files         162      162              
  Lines       11002    11004       +2     
  Branches     1798     1798              
==========================================
  Hits         9786     9786              
- Misses        937      938       +1     
- Partials      279      280       +1     
Impacted Files Coverage Δ
scrapy/exporters.py 100.00% <100.00%> (ø)
scrapy/core/downloader/__init__.py 90.97% <0.00%> (-1.51%) ⬇️

@srki24
Copy link
Author

srki24 commented Nov 6, 2022

I don't think that this is a problem in general, since JSON exporter is confirmed to work as expected. Although I haven't tested others, none of them uses IO stream, but are rather writing directly to the files.

Regarding the tests, I am not certain that I will know how to do it, as I have never worked with twisted nor tox.

I can take a brief look into that as well, but I cannot promise anything.

@Gallaecio
Copy link
Member

[No exporter] uses IO stream, but are rather writing directly to the files.

Makes sense.

I was going to suggest covering this in the docs about creating custom item exporters, but it turns out we have none 🤦 .

Regarding the tests, I am not certain that I will know how to do it, as I have never worked with twisted nor tox.

I can take a brief look into that as well, but I cannot promise anything.

I must say I am not certain either how to test this. I am surprised it is not caught by existing tests.

If you have the time and energy, please have a look at BatchDeliveriesTest in tests/test_feedexport.py, and see if you can figure out why existing tests did not pick up on this issue. Maybe it is a race condition, that in the tests we are closing file before the exporter class gets destroyed?

Otherwise, we can try to find the time to look at it ourselves, or try to get other contributors to chip in.

# Detaching stream in order to avoid file closing.
# The file will be closed with slot.storage.store
# https://github.com/scrapy/scrapy/issues/5043
self.stream.detach()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to happen on finish exporting, or can it happen on __init__, right after initializing self.stream?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to happen on finish exporting, or can it happen on __init__, right after initializing self.stream?

I haven't actually tried that, but I doubt that it will work. From the documentation:
After the raw stream has been detached, the buffer is in an unusable state.
and thus it seems to me that it would make sense to detach the stream with finish_exporting.

Maybe it is a race condition, that in the tests we are closing file before the exporter class gets destroyed?

I am pretty certain that this is the case since while testing I managed to avoid the problem without detaching input stream by sleeping feedexport for a few seconds after slot is closed, but before a new batch is started:

self._close_slot(slot, spider)
slots.append(self._start_new_batch(
batch_id=slot.batch_id + 1,
uri=slot.uri_template % uri_params,
feed_options=self.feeds[slot.uri_template],
spider=spider,
uri_template=slot.uri_template,

We might need to sleep the code somewhere in order to obtain successful tests?

@srki24
Copy link
Author

srki24 commented Nov 22, 2022

I tried a few things regarding test and didn't get anywhere.

I think that the problem might be because Stubber is not actually sending requests and thus the test never reaches a part where a closed CSV file is to be read...
In short, I give up since I have no idea how to build successful test for this.

@Gallaecio
Copy link
Member

No worries, we will try and find the time to work on that test ourselves. Great work!

@wRAR
Copy link
Member

wRAR commented Jun 13, 2023

Thanks!

@wRAR wRAR merged commit 9e0bfc4 into scrapy:master Jun 13, 2023
@srki24 srki24 deleted the issue5043-feed_export branch June 18, 2023 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3 CsvItemExporter read of closed file error
3 participants