Add support for HTTP response objects to Image.open() #1151

Merged
merged 1 commit into from Mar 28, 2015

Projects

None yet

6 participants

@mfitzp
Contributor
mfitzp commented Mar 26, 2015

HTTP response objects returned from urllib2.urlopen(url) or
requests.get(url, stream=True).raw are 'file-like' but do not
support .seek() operations. As a result PIL is unable to
open them as images, requiring a wrap in cStringIO or BytesIO.

This commit adds this functionality to Image.open() by way of
an .seek(0) check and catch on exception
AttributeError or io.UnsupportedOperation. If this is caught
we attempt to wrap the object using io.BytesIO (which will
only work on buffer-file-like objects).

This allows opening of files using both urllib2 and requests, e.g.

Image.open(urllib2.urlopen(url))
Image.open(requests.get(url, stream=True).raw)

This PR addresses feature request: #1148

@mfitzp mfitzp Add support for HTTP response objects to Image.open()
HTTP response objects returned from `urllib2.urlopen(url)` or
`requests.get(url, stream=True).raw` are 'file-like' but do not
support `.seek()` operations. As a result PIL is unable to
open them as images, requiring a wrap in `cStringIO` or `BytesIO`.

This commit adds this functionality to `Image.open()` by way of
an `.seek(0)` check and catch on exception
`AttributeError` or `io.UnsupportedOperation`. If this is caught
we attempt to wrap the object using `io.BytesIO` (which will
only work on buffer-file-like objects).

This allows opening of files using both `urllib2` and `requests`, e.g.

    Image.open(urllib2.urlopen(url))
    Image.open(requests.get(url, stream=True).raw)
735d342
@mfitzp
Contributor
mfitzp commented Mar 26, 2015

This PR uses io.BytesIO as this is supported from Python 2.6+ (which PIL currently targets)

@aclark4life
Member

Wow nice, thanks

@mfitzp mfitzp referenced this pull request in matplotlib/matplotlib Mar 26, 2015
Closed

Simplify handling of remote JPGs #4252

@aclark4life
Member

Looks harmless to me, any thoughts @wiredfool @hugovk ?

@hugovk
Member
hugovk commented Mar 28, 2015

Looks ok. How about adding the example code to the documentation somewhere?

It's probably not worth the effort mocking HTTP to add unit tests.

@aclark4life aclark4life merged commit aaa26f3 into python-pillow:master Mar 28, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@aclark4life aclark4life added a commit that referenced this pull request Mar 28, 2015
@aclark4life aclark4life Document #1151 [ci skip] b3ea34f
@aclark4life
Member

@hugovk Thanks, done. Added to release notes for 2.8.0. Thanks again @mfitzp! I'm excited about this 🍻 Any security implications here @wiredfool? I don't particularly want to be patching this feature in a future hotfix 😉

@mjpieters
Contributor

Great feature!

One note: if the response uses content-encoding (compression, either gzip or deflate) then this'll still fail as both the urllib2 and requests raw file object will produce compressed data in that case. Using Content-Encoding on images is rather non-sensical as most images are already compressed, but it can still happen.

For requests the work-around is to set the decode_content attribute on the raw object to True:

response = requests.get(url, stream=True)
response.raw.decode_content = True
image = Image.open(response.raw)

Handy information if the new functionality is being documented somewhere!

@aclark4life
Member

Thanks @mjpieters, added to release notes in 174d9ac

@apollo13

As metioned on the ml: If the file in question is an actual file on the disk, the seek call will add an extra syscall, not sure if that is wanted. Also it changes the semantics of how Pillow works. For instance, I do know that I had code somewhere where I purposely seeked a bit into the file to get over some additional metadata information which was added by another system…

@mfitzp
Contributor
mfitzp commented Mar 30, 2015

@apollo13 There is already another seek(0) just after this point before moving into the factory functions for creating images:

fp.seek(0)
im = factory(fp, filename)

So if you called open() with a file object not at zero, it will end up back there before the image data is loaded. I think the only difference would be where the 16-byte prefix is read from?

Are there any non-syscall ways to test for a seekable file? It's the fp.seek(0) that fails for not-quite-real file objects and so requires them to be wrapped (which is why it seemed like a good test).

@wiredfool
Member

+1 in general.

I don't think that this is a potential security issue, as we're just wrapping one file like object in another one. It's essentially an automatic version of what is generally recommended for processing already in memory items. We're not doing dangerous things like interpreting string objects as images or trying images as paths.

I think that there might be 2 potential performance issues:

  • The exta seek. Minor, but it's going to exist on every image open. It also may be optimized out if the interpreter knows that the fp is at position 0, or it may not and may force a resync of the file object to it's backing fp.
  • We're going to be creating an extra copy of the entire file in memory before doing anything with the data (including deciding if it's a valid image). If what we're reading in is not actually an image, our current method reads about 16 bytes, this one will potentially duplicate gigs of junk in memory before deciding that it's not a valid image. It doesn't look like there's a way around this with the io.* classes in 2.7 at least.
@mfitzp
Contributor
mfitzp commented Mar 30, 2015

Looking at the extra seek issue the io.IOBase class actually has a .seekable() check that returns False if the IO object is non-seekable. This allows us to test (on classes derived from io.IOBase without actually initiating a seek. This is implemented in both requests and in Python 3 urllib, but unfortunately is not implemented on urllib2 objects. However we can use that absence as an alternative test.

In regard to the issue of loading the entire file (duplicating in memory) before we know if it's possible to even open it (rather than testing the first 16 bytes first) this should also be 'fixable' by wrapping the file just before it enters the factory. For example we can set a flag requires_iobytes_wrapper following the initial check and test for it later.

The alternative code then would look something like this:

if mode != "r":
    raise ValueError("bad mode %r" % mode)

if isPath(fp):
    filename = fp
    fp = builtins.open(fp, "rb")
else:
    filename = ""

try:
    requires_iobytes_wrapper = not fp.seekable()
except:
    requires_iobytes_wrapper = True

prefix = fp.read(16)

preinit()

for i in ID:
    try:
        factory, accept = OPEN[i]
        if not accept or accept(prefix):

            if requires_iobytes_wrapper:
                fp = io.BytesIO(fp.read())

            fp.seek(0)
            im = factory(fp, filename)
            _decompression_bomb_check(im.size)
            return im
    except (SyntaxError, IndexError, TypeError):
        # import traceback
        # traceback.print_exc()
        pass

If there is a preference for taking this approach re-open the PR / I'm happy to submit a new commit.

@aclark4life
Member

Is it possible to reopen the PR? If so, I don't see how e.g. http://stackoverflow.com/a/12838195/185820. Otherwise, yes seems like a good idea to add those things.

@mfitzp mfitzp deleted the mfitzp:open-http-image branch Mar 30, 2015
@mfitzp
Contributor
mfitzp commented Mar 30, 2015

Just realised a fatal flaw in this approach. The prefix read of 16 bytes means that is chopped off in the subsequent read into io.BytesIO and the factory fails. Should be possible to work around this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment