-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Postprocessing feeds do not work for S3 feed storage #5500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
I was able to fix this ad hoc for my project. This is what I did:
I was able to monkey patch everything with the following code: from scrapy.extensions.postprocessing import PostProcessingManager, GzipPlugin
def read(self, *args, **kwargs):
return self.file.read(*args, **kwargs)
def seek(self, *args, **kwargs):
# Only time seek is executed is when uploading the finished file
if hasattr(self.head_plugin, "gzipfile") and not self.head_plugin.gzipfile.closed:
self.head_plugin.gzipfile.flush()
# It should be safe to close at this point
self.head_plugin.gzipfile.close()
return self.file.seek(*args, **kwargs)
def close(self):
# Gzip is already closed by PostProcessingManager.seek
self.file.close()
PostProcessingManager.read = read
PostProcessingManager.seek = seek
GzipPlugin.close = close However, this code assumes only GzipPlugin will be used and seek will only be called right before writting the file to s3. |
wctjerry
added a commit
to wctjerry/crawl-etl-language-exchangers
that referenced
this issue
Aug 20, 2022
…is uncompitable with S3 storage: scrapy/scrapy#5500
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Example settings:
This causes an exception:
Apparently
scrapy.extensions.postprocessing.PostProcessingManager
doesn't fully implement file protocol. Adding this method to the class:Cause an exception in a different place:
Apparently
boto
excepts aread()
method to be present as well (here).Tried to add
read()
method toscrapy.extensions.postprocessing.PostProcessingManager
as well but I only received an incomplete file. I think it's possible becausegzip.GzipFile
use some buffering so it only save full file whenclose()
is called on it. SinceS3FeedStorage
uses internallytempfile.NamedTemporaryFile
, this cause the file to disappear right after creation.PostProcessingManager
needs to be refactored so it can handleBlockingFeedStorage
correctly.Versions
Additional context
The text was updated successfully, but these errors were encountered: