New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support content-length checking #1938

Closed
jaraco opened this Issue Mar 4, 2014 · 8 comments

Comments

Projects
None yet
3 participants
@jaraco
Contributor

jaraco commented Mar 4, 2014

I have an application that wants to do some content length validation based on the Content-Length reported by the server. Currently, if the response has more or fewer bytes than indicated by the Content-Length, there's no error and due to encoding, it's not possible to detect the length of the payload actually read. Consider:

#!/usr/bin/env python3
import requests
resp = requests.get('http://en.wikipedia.org/wiki/Snowman', stream=True)
lines = list(resp.iter_lines(decode_unicode=True))
# this will fail if the content contains multi-byte characters
assert sum(map(len, lines)) == int(resp.headers['Content-Length'])

Of course, it would be possible if decoding of content is disabled, but then the reader is responsible for decoding. It would be preferable if there were a hook or attribute to enable the user to do the byte checking without forgoing decoding support.

Two suggestions:

  • Have requests supply a 'bytes_read' on the Response object, which will be updated on any read operation.
  • Provide a hook in the request allowing the user to supply a custom Response subclass for customizing response handling behavior.
@Lukasa

This comment has been minimized.

Member

Lukasa commented Mar 4, 2014

Hi there! Thanks for raising this issue!

I assume your logic application logic is not the same as your example, because if it were I'd simply say you should avoid using stream=True. That way we store the data off and you can access it at your leisure using Response.content and Response.text.

The bigger problem you're going to have is that we transparently decompress gzipped and deflated data. This means that if you're served such a response your content length checking will fail. I think the key question we have to ask is: what are you trying to achieve with your Content-Length checking? If we can answer that, we'll be able to see whether you're trying to do something that Requests can make easier for you.

@jaraco

This comment has been minimized.

Contributor

jaraco commented Mar 5, 2014

My assumptions was that by passing the Content-Length, the server is providing some indication about what to expect. If I can make that assumption, then I'd like to validate that expectation at whatever point makes sense. I'd like to be able to do this validation regardless of whether streaming content is used or eager loading and regardless of the content was encoded, gripped, or deflated. I presume the content-length has a meaningful value somewhere in the stack. It's at that point I want to capture the actual length.

I'm porting from a urllib2 implementation that assumed bytes on the response and handled all of the decoding aspects. I'm looking forward to leveraging the requests functionality for handling that more richly. I'll grab a real world example from our code of what we are trying to accomplish.

@sigmavirus24

This comment has been minimized.

Member

sigmavirus24 commented Mar 5, 2014

The short answer is that we are not going to extend the existing API unless it is either:

  1. Necessary
  2. Likely to be used by more than 90% of the users

It is important to note that the decoding of content takes place in urllib3 if I remember correctly. You also have access to the urllib3 response object in resp._raw, but that's not meant to be used by the typical user of requests. The right place to add this, regardless, is in urllib3. That would keep track of the exact number of bytes read (before decoding) and could store the data for you. That said, this discussion will certainly be easier after you provide an example for us to look at.

@Lukasa

This comment has been minimized.

Member

Lukasa commented Mar 5, 2014

There are two forms of decoding here: decoding from compressed data to bytes, and then decoding from bytes to unicode. The first is done in urllib3, the second in Requests. Because the first is done in urllib3, Requests never sees the gzipped bytes. This means the only way to achieve what you want in requests is to reimplement the convenience methods that urllib3 already has for decompressing data. That just seems silly, so I recommend following @sigmavirus24's advice and getting this into urllib3.

@jaraco

This comment has been minimized.

Contributor

jaraco commented Mar 5, 2014

In our older code which used httplib2, we had this class:

class ContentLengthValidator(object):
    """
    Verify the length of content in a line-buffered stream. Used to wrap
    any iterable of items of length. The validator will raise a
    ValueError if the iterator is consumed and the declared length
    doesn't match or if the consumed length exceeds the declared length.

    >>> sample = ['abc', 'bcd', 'def']
    >>> v = ContentLengthValidator(iter(sample), 9)
    >>> list(v)
    ['abc', 'bcd', 'def']

    >>> v = ContentLengthValidator(iter(sample), 8)
    >>> list(v)
    Traceback (most recent call last):
    ...
    ValueError: Expected 8 bytes, received 9

    >>> import itertools
    >>> v = ContentLengthValidator(itertools.cycle(sample), 1000)
    >>> list(v)
    Traceback (most recent call last):
    ...
    ValueError: Expected 1000 bytes, received 1002
    """
    def __init__(self, stream, declared_length):
        self.stream = stream
        self.declared_length = declared_length
        self.length_read = 0

    def __iter__(self):
        return self

    def next(self):
        try:
            data = self.stream.next()
        except StopIteration:
            if not self.length_read == self.declared_length:
                self.__fail()
            raise
        self.length_read += len(data)
        if self.length_read > self.declared_length:
            self.__fail()
        return data

    def __fail(self):
        msg = "Expected %(declared_length)d bytes, received %(length_read)d"
        raise ValueError(msg % vars(self))

It would be used to wrap an urllib2 response like so:

resp = urllib2.urlopen(url)
if 'Content-Length' in resp.headers:
    resp = ContentLengthValidator(resp, resp.headers['Content-Length'])
# load resp as CSV
reader = csv.DictReader(resp, ...)

The wrapper would serve to count the bytes as they passed through the iteration and then raise a ValueError at the end if they did not match the expectation.

In an attempt to update the code for Python 3 support, we desired to use requests to provide higher-level handling (including support for caching through cachecontrol), so I modified the ContentLengthValidator to be Requests specific:

class ContentLengthValidator(object):
    """
    Verify the length of content in a Requests response. The validator will
    raise a ValueError if the bytes read doesn't match the declared length.
    """
    def __init__(self, resp):
        self.resp = resp
        self.iter_content_orig = self.resp.iter_content
        self.resp.iter_content = self.count_bytes

    def count_bytes(self, *args, **kwargs):
        self.bytes_read = 0
        gen = self.iter_content_orig(*args, **kwargs)
        for res in gen:
            self.bytes_read += len(res)
            yield res

    def iter_lines(self):
        for line in self.resp.iter_lines():
            yield line
        self._check_complete()

    def _check_complete(self):
        if 'content-length' not in self.resp.headers:
            return
        target_length = int(self.resp.headers['content-length'])
        if target_length == self.bytes_read:
            return
        msg = "Expected {target_length} bytes, received {bytes_read}"
        raise ValueError(msg.format(bytes_read=self.bytes_read, **vars(self)))

The wrapper replaces 'iter_content' on the Response object with another iterator which will count the bytes as they're read. This technique is working for us now, but as you can see by trying to read it, the technique is clumsy. Furthermore, as you've pointed out, this technique will probably fail if gzip is used. That's why it's important for the requests library to support this functionality - because the subtle nuances of validating the Content-Length are hard, and if it can be done right once in requests, then no one else has to try and get it wrong.

I'd much rather be able to do something like:

resp = requests.get(url, stream=True)
data = list(resp)
assert resp.bytes_received == resp.headers.get('content-length', resp.bytes_received)
@jaraco

This comment has been minimized.

Contributor

jaraco commented Mar 5, 2014

Looking into urllib3, it looks like the urllib3 Response object has a _fp_bytes_read attribute which may be just what I need.

@jaraco

This comment has been minimized.

Contributor

jaraco commented Mar 5, 2014

Given the limited usefulness of validating the length and the fact that the raw length is readily available on the raw object, I'm able to achieve my goals with something like this:

resp = requests.get(url, stream=True)
data = list(resp)
bytes_received = resp.raw._fp_bytes_read
assert bytes_received == resp.headers.get('content-length', bytes_received)

And that test should be more accurate than the one we previously had. Thanks for entertaining the issue.

@jaraco jaraco closed this Mar 5, 2014

@Lukasa

This comment has been minimized.

Member

Lukasa commented Mar 5, 2014

I'm glad you were able to find something that suits your use-case. Let us know if you find it's particularly troublesome: I'm sure the right place for the fix is in urllib3, but we can help you get it there if you find you need it. =)

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment