Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilesPipeline extracted and separated from ImagesPipeline #370

Merged
merged 8 commits into from Sep 3, 2013

Conversation

loucash
Copy link
Contributor

@loucash loucash commented Aug 16, 2013

  • ImagesPipeline interface has not changes
  • ImagesPipeline has been built on top of FilesPipeline
  • tests for FilesPipeline and ImagesPipeline has been separated
  • new tests for FilesPipeline: file_key generation and file expiration testing

FilesPipeline functionality is:

  • file downloading
  • minimize network transfers and file processing, doing stat of the files and determining if file is new, uptodate or expired.

ImagesPipeline functionality is:

  • converting images to JPG
  • image thumbnail generation logic

It has been extracted from ImagesPipelines.
ImagesPipeline is built on top of FilesPipeline and consist only with convert image and thumbnail generation logic.
@dangra
Copy link
Member

dangra commented Aug 21, 2013

@redapple and @whodatninja worked on the same problem in #250, what do you think guys?

I like this patch keeps compatibility for ImagesPipeline API and add tests for FilesPipelines.

#250 also adds some extra settings and functionality to ImagesPipeline that are better covered by an extra patch.

with open(absolute_path, 'w') as f:
f.write(buf.getvalue())

def stat_image(self, key, info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be called stat_file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

spider.crawler.stats.inc_value('file_count', spider=spider)
spider.crawler.stats.inc_value('file_status_count/%s' % status, spider=spider)

### Overradiable Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small type: Overridable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, fixed

def get_media_requests(self, item, info):
return [Request(x) for x in item.get('file_urls', [])]

def file_key(self, url):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if my comment should be part of this change, but there was a question the other day on StackOverflow about changing the filename (for ImagesPipeline) based not only on URL, but also based on some info from the originating item. One of my ideas was to additionally pass the request or response to image_key(). It's helpful in case you add meta to Request in get_media_requests()
My answer to http://stackoverflow.com/questions/18081997/scrapy-customize-image-pipeline-with-renaming-defualt-image-name/18083143 suggests something else though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would like to treat those changes rather as a refactorization that extracts FilesPipeline and keep the interface.

I think new functionalities is a separated topic for a separated pull request, probably built on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

return checksum

def get_images(self, response, request, info):
key = self.image_key(request.url)
key = self.file_key(request.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to keep the call to image_key for people having overriden it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, fixed within last commit

@@ -34,7 +35,7 @@ def tearDown(self):
rmtree(self.tempdir)

def test_image_path(self):
image_path = self.pipeline.image_key
image_path = self.pipeline.file_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also, I would keep self.pipeline.image_key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about it, and maybe it is better to keep file_key. Let me explain:
ImagesPipeline inherits from FilesPipeline and FilesPipeline internals requires file_key.

If someone, by accident removes "def file_key" from ImagesPipeline then image_key will never be used and results will be wrong.

So, here, by using file_key, you test also a pointer from file_key to image_key in ImagesPipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@dangra
Copy link
Member

dangra commented Aug 26, 2013

everyone involved satisfied with this patch?

@redapple
Copy link
Contributor

Looks ok to me.

@loucash
Copy link
Contributor Author

loucash commented Aug 27, 2013

Looks good to me.

dangra added a commit that referenced this pull request Sep 3, 2013
FilesPipeline extracted and separated from ImagesPipeline
@dangra dangra merged commit f54d5c5 into scrapy:master Sep 3, 2013
@max-arnold
Copy link
Contributor

Has anyone considered changing file_key(), image_key() and thumb_key() methods to accept Request instead of url? This will allow to pass some context from an item by overriding get_media_requests() and using Request.meta.

To do that right now it is necessary to override a lot of methods (especially media_to_download(), media_downloaded() and file_downloaded() in files.py which call file_key() and are quite monolithic):

http://stackoverflow.com/questions/12956653/scrapy-create-folder-structure-out-of-downloaded-images-based-on-the-url-from
http://stackoverflow.com/questions/18081997/scrapy-customize-image-pipeline-with-renaming-defualt-image-name

@redapple
Copy link
Contributor

@max-arnold I had commented in this thread some time ago around those ideas
#370 (diff)

A few ideas I'd like to see were in #250

@dangra
Copy link
Member

dangra commented Dec 17, 2013

This issue pop ups from time to time, and I think it was never addressed because nobody took the task og implementing it in a backward compatible way.

We considered it in the past but only with the Request object because in media_to_download hook the Response is not available.

+1 to merge something like this if it has tests and it's backwards compatible.

@max-arnold
Copy link
Contributor

Is something like this looks fine in terms of backward compatibility (same with image_key and thumb_key)?

    def file_key(self, request):
        if isinstance(request, Request):
            url = request.url
        else:
            url = request
            log.warning('Passing url to file_key() is deprecated, please use Request instance instead')

        media_guid = hashlib.sha1(url).hexdigest()
        media_ext = os.path.splitext(url)[1]
        return 'full/%s%s' % (media_guid, media_ext)

@dangra
Copy link
Member

dangra commented Dec 17, 2013

@max-arnold : no, that isn't enough because subclasses defining its own file_key() expects an url and won't handle a Request.

we need a new method to wrap file_key() and others.

@dangra
Copy link
Member

dangra commented Dec 17, 2013

I'm feeling bad for naming the new methods, what do you think about this names:

  • file_key -> file_path
  • image_key -> file_path # note: same name than file_key is on purpose
  • thumbs_key -> thumbnail_path

@dangra
Copy link
Member

dangra commented Dec 17, 2013

the new methods signature will be:

def file_path(request, response=None, info=None):

@max-arnold
Copy link
Contributor

Ok, I need some time to think and prepare a pull request.

@dangra
Copy link
Member

dangra commented Dec 17, 2013

thx

@dangra
Copy link
Member

dangra commented Dec 17, 2013

@redapple: join the party! what do you think about the new methods and its signatures? are you OK with this approach too?

@max-arnold
Copy link
Contributor

See #490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants