New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilesPipeline extracted and separated from ImagesPipeline #370
Conversation
It has been extracted from ImagesPipelines. ImagesPipeline is built on top of FilesPipeline and consist only with convert image and thumbnail generation logic.
…8 changes in MediaPipeline
@redapple and @whodatninja worked on the same problem in #250, what do you think guys? I like this patch keeps compatibility for ImagesPipeline API and add tests for FilesPipelines. #250 also adds some extra settings and functionality to ImagesPipeline that are better covered by an extra patch. |
with open(absolute_path, 'w') as f: | ||
f.write(buf.getvalue()) | ||
|
||
def stat_image(self, key, info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be called stat_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
spider.crawler.stats.inc_value('file_count', spider=spider) | ||
spider.crawler.stats.inc_value('file_status_count/%s' % status, spider=spider) | ||
|
||
### Overradiable Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small type: Overridable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, fixed
def get_media_requests(self, item, info): | ||
return [Request(x) for x in item.get('file_urls', [])] | ||
|
||
def file_key(self, url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure if my comment should be part of this change, but there was a question the other day on StackOverflow about changing the filename (for ImagesPipeline) based not only on URL, but also based on some info from the originating item. One of my ideas was to additionally pass the request or response to image_key()
. It's helpful in case you add meta
to Request
in get_media_requests()
My answer to http://stackoverflow.com/questions/18081997/scrapy-customize-image-pipeline-with-renaming-defualt-image-name/18083143 suggests something else though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would like to treat those changes rather as a refactorization that extracts FilesPipeline and keep the interface.
I think new functionalities is a separated topic for a separated pull request, probably built on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
return checksum | ||
|
||
def get_images(self, response, request, info): | ||
key = self.image_key(request.url) | ||
key = self.file_key(request.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to keep the call to image_key
for people having overriden it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, fixed within last commit
@@ -34,7 +35,7 @@ def tearDown(self): | |||
rmtree(self.tempdir) | |||
|
|||
def test_image_path(self): | |||
image_path = self.pipeline.image_key | |||
image_path = self.pipeline.file_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also, I would keep self.pipeline.image_key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought about it, and maybe it is better to keep file_key. Let me explain:
ImagesPipeline inherits from FilesPipeline and FilesPipeline internals requires file_key.
If someone, by accident removes "def file_key" from ImagesPipeline then image_key will never be used and results will be wrong.
So, here, by using file_key, you test also a pointer from file_key to image_key in ImagesPipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
everyone involved satisfied with this patch? |
Looks ok to me. |
Looks good to me. |
FilesPipeline extracted and separated from ImagesPipeline
Has anyone considered changing file_key(), image_key() and thumb_key() methods to accept Request instead of url? This will allow to pass some context from an item by overriding get_media_requests() and using Request.meta. To do that right now it is necessary to override a lot of methods (especially media_to_download(), media_downloaded() and file_downloaded() in files.py which call file_key() and are quite monolithic): http://stackoverflow.com/questions/12956653/scrapy-create-folder-structure-out-of-downloaded-images-based-on-the-url-from |
@max-arnold I had commented in this thread some time ago around those ideas A few ideas I'd like to see were in #250 |
This issue pop ups from time to time, and I think it was never addressed because nobody took the task og implementing it in a backward compatible way. We considered it in the past but only with the Request object because in +1 to merge something like this if it has tests and it's backwards compatible. |
Is something like this looks fine in terms of backward compatibility (same with image_key and thumb_key)?
|
@max-arnold : no, that isn't enough because subclasses defining its own file_key() expects an url and won't handle a Request. we need a new method to wrap |
I'm feeling bad for naming the new methods, what do you think about this names:
|
the new methods signature will be:
|
Ok, I need some time to think and prepare a pull request. |
thx |
@redapple: join the party! what do you think about the new methods and its signatures? are you OK with this approach too? |
See #490 |
FilesPipeline functionality is:
ImagesPipeline functionality is: