FilesPipeline extracted and separated from ImagesPipeline #370

loucash · 2013-08-16T16:56:49Z

ImagesPipeline interface has not changes
ImagesPipeline has been built on top of FilesPipeline
tests for FilesPipeline and ImagesPipeline has been separated
new tests for FilesPipeline: file_key generation and file expiration testing

FilesPipeline functionality is:

file downloading
minimize network transfers and file processing, doing stat of the files and determining if file is new, uptodate or expired.

ImagesPipeline functionality is:

converting images to JPG
image thumbnail generation logic

It has been extracted from ImagesPipelines. ImagesPipeline is built on top of FilesPipeline and consist only with convert image and thumbnail generation logic.

…8 changes in MediaPipeline

dangra · 2013-08-21T16:58:15Z

@redapple and @whodatninja worked on the same problem in #250, what do you think guys?

I like this patch keeps compatibility for ImagesPipeline API and add tests for FilesPipelines.

#250 also adds some extra settings and functionality to ImagesPipeline that are better covered by an extra patch.

redapple · 2013-08-22T10:22:22Z

scrapy/contrib/pipeline/files.py

+        with open(absolute_path, 'w') as f:
+            f.write(buf.getvalue())
+
+    def stat_image(self, key, info):


could be called stat_file

redapple · 2013-08-22T12:27:36Z

scrapy/contrib/pipeline/files.py

+        spider.crawler.stats.inc_value('file_count', spider=spider)
+        spider.crawler.stats.inc_value('file_status_count/%s' % status, spider=spider)
+
+    ### Overradiable Interface


Small type: Overridable

redapple · 2013-08-22T12:35:29Z

scrapy/contrib/pipeline/files.py

+    def get_media_requests(self, item, info):
+        return [Request(x) for x in item.get('file_urls', [])]
+
+    def file_key(self, url):


Not really sure if my comment should be part of this change, but there was a question the other day on StackOverflow about changing the filename (for ImagesPipeline) based not only on URL, but also based on some info from the originating item. One of my ideas was to additionally pass the request or response to image_key(). It's helpful in case you add meta to Request in get_media_requests()
My answer to http://stackoverflow.com/questions/18081997/scrapy-customize-image-pipeline-with-renaming-defualt-image-name/18083143 suggests something else though

Well, I would like to treat those changes rather as a refactorization that extracts FilesPipeline and keep the interface.

I think new functionalities is a separated topic for a separated pull request, probably built on this one.

redapple · 2013-08-22T13:11:42Z

scrapy/contrib/pipeline/images.py

        return checksum

    def get_images(self, response, request, info):
-        key = self.image_key(request.url)
+        key = self.file_key(request.url)


I think you need to keep the call to image_key for people having overriden it

yup, fixed within last commit

redapple · 2013-08-22T13:17:03Z

scrapy/tests/test_pipeline_images.py

@@ -34,7 +35,7 @@ def tearDown(self):
        rmtree(self.tempdir)

    def test_image_path(self):
-        image_path = self.pipeline.image_key
+        image_path = self.pipeline.file_key


here also, I would keep self.pipeline.image_key

Yes, I thought about it, and maybe it is better to keep file_key. Let me explain:
ImagesPipeline inherits from FilesPipeline and FilesPipeline internals requires file_key.

If someone, by accident removes "def file_key" from ImagesPipeline then image_key will never be used and results will be wrong.

So, here, by using file_key, you test also a pointer from file_key to image_key in ImagesPipeline.

dangra · 2013-08-26T23:13:23Z

everyone involved satisfied with this patch?

redapple · 2013-08-27T07:31:44Z

Looks ok to me.

loucash · 2013-08-27T07:59:13Z

Looks good to me.

…working

FilesPipeline extracted and separated from ImagesPipeline

max-arnold · 2013-12-17T12:10:07Z

Has anyone considered changing file_key(), image_key() and thumb_key() methods to accept Request instead of url? This will allow to pass some context from an item by overriding get_media_requests() and using Request.meta.

To do that right now it is necessary to override a lot of methods (especially media_to_download(), media_downloaded() and file_downloaded() in files.py which call file_key() and are quite monolithic):

http://stackoverflow.com/questions/12956653/scrapy-create-folder-structure-out-of-downloaded-images-based-on-the-url-from
http://stackoverflow.com/questions/18081997/scrapy-customize-image-pipeline-with-renaming-defualt-image-name

redapple · 2013-12-17T12:14:20Z

@max-arnold I had commented in this thread some time ago around those ideas
#370 (diff)

A few ideas I'd like to see were in #250

dangra · 2013-12-17T12:33:33Z

This issue pop ups from time to time, and I think it was never addressed because nobody took the task og implementing it in a backward compatible way.

We considered it in the past but only with the Request object because in media_to_download hook the Response is not available.

+1 to merge something like this if it has tests and it's backwards compatible.

max-arnold · 2013-12-17T13:01:17Z

Is something like this looks fine in terms of backward compatibility (same with image_key and thumb_key)?

    def file_key(self, request):
        if isinstance(request, Request):
            url = request.url
        else:
            url = request
            log.warning('Passing url to file_key() is deprecated, please use Request instance instead')

        media_guid = hashlib.sha1(url).hexdigest()
        media_ext = os.path.splitext(url)[1]
        return 'full/%s%s' % (media_guid, media_ext)

dangra · 2013-12-17T13:13:52Z

@max-arnold : no, that isn't enough because subclasses defining its own file_key() expects an url and won't handle a Request.

we need a new method to wrap file_key() and others.

dangra · 2013-12-17T13:20:07Z

I'm feeling bad for naming the new methods, what do you think about this names:

file_key -> file_path
image_key -> file_path # note: same name than file_key is on purpose
thumbs_key -> thumbnail_path

dangra · 2013-12-17T13:23:36Z

the new methods signature will be:

def file_path(request, response=None, info=None):

max-arnold · 2013-12-17T13:26:09Z

Ok, I need some time to think and prepare a pull request.

dangra · 2013-12-17T13:28:53Z

thx

dangra · 2013-12-17T13:29:39Z

@redapple: join the party! what do you think about the new methods and its signatures? are you OK with this approach too?

max-arnold · 2013-12-17T15:36:39Z

See #490

loucash added 3 commits August 16, 2013 17:02

FilesPipeline which enalbes to download any files.

76ce8c5

It has been extracted from ImagesPipelines. ImagesPipeline is built on top of FilesPipeline and consist only with convert image and thumbnail generation logic.

Test reorganization and new tests for Files and Images Pipelines, PEP…

45ff6ec

…8 changes in MediaPipeline

typo

96077cc