New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `enforce_content_length` for responses #949

Merged
merged 2 commits into from Aug 29, 2016

Conversation

Projects
None yet
4 participants
@nateprewitt
Contributor

nateprewitt commented Aug 17, 2016

So here's a pass at #723. This is kind of a weird edge case but it particularly prominent in the default configuration of Requests. Most calls performed by urllib3 will raise a IncompleteRead error from httplib when the number of bytes in the body doesn't match the Content-Length.

The Skinny

httplib raises IncompleteReads appropriately everywhere except on incrementally read data. This is the primary way Requests uses urlopen with preload_content=False and then reading with iter_content(). Retrieving data this way hits the flaw in httplib. I've added a flag to enable this functionality, so as not to break stream(amt) and read(amt) calls presently. In the next major release, I would advise the flag being removed to make all read operations uniform by default.

Notes:

  • My unfamiliarity with the testing harness is definitely showing in test_strict_content_length but the test does prove the changes are working correctly. Tornado won't allow you to send uneven data, so this was the only other solution I could come up with. Any suggestions on alternative methods of simulating this problem would be appreciated.
  • I implemented length as a property to match the attribute nature of httplib.HTTPResponse.length. I realize an int that we modify may be preferred to a property, but felt it would be more likely to break if we implement int updates everywhere IO might happen in the code.
@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 17, 2016

Nate: while you're working on a fix for this I'll hold off from reviewing. No reason beyond the fact that I'm trying to manage my workload!

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 17, 2016

Totally! I think I've got the fix worked out, but doing some final testing. No rush on this at all.

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 5 times, most recently from 6fc439a to a92dfda Aug 17, 2016

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 17, 2016

Keep hitting socket overlap, can someone kick the tests when they get a chance? Thanks!

@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 18, 2016

Restarted.

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 4 times, most recently from 09addb7 to 1632b96 Aug 22, 2016

"""
CONTENT_DECODERS = ['gzip', 'deflate']
REDIRECT_STATUSES = [301, 302, 303, 307, 308]
def __init__(self, body='', headers=None, status=0, version=0, reason=None,
strict=0, preload_content=True, decode_content=True,
original_response=None, pool=None, connection=None, retries=None):
original_response=None, pool=None, connection=None,
strict_content_length=False, retries=None, request_method=None):

This comment has been minimized.

@nateprewitt

nateprewitt Aug 23, 2016

Contributor

Should request_method be replaced with **response_kw so we're not defining minimal contact params?

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

I don't think so. However, please put strict_content_length after retries.

@@ -327,9 +360,15 @@ def read(self, amt=None, decode_content=None, cache_content=False):
# no harm in redundantly calling close.
self._fp.close()
flush_decoder = True
if (self.strict_content_length and self.length not in (0, None)):
# httplib requires an iterator/string instead of count of bytes
data = [0 for x in range(self._fp_bytes_read)]

This comment has been minimized.

@nateprewitt

nateprewitt Aug 23, 2016

Contributor

This is awkward. This error is used everywhere else this issue occurs though, so it feels off to implement another error to avoid the need of this iterator.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 23, 2016

Contributor

Added custom exception in 2015f77, it has slightly different semantics (taking int instead of iterator as first arg). This seems like it might cause confusion but avoids creating a possibly huge list on large streams just to reuse the httplib exception. I'm +.5 on this and open to alternatives I'm missing.

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 2 times, most recently from 7bf3882 to 5c8becc Aug 23, 2016

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 23, 2016

Alright, I think this is ready for a glance whenever you have a spare moment @Lukasa.

This still has some rough edges, so I left a few comments inline, as well as my initial comment in the opening post.

Thanks!

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 3 times, most recently from 5d087f6 to 2015f77 Aug 23, 2016

def test_length_when_chunked(self):
headers = {'content-length': '5',
'transfer-encoding': 'chunked'}

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

This combination of headers is forbidden by RFC 7230 Section 3.3.2:

A sender MUST NOT send a Content-Length header field in any message that contains a Transfer-Encoding header field.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

I added this test because the initial httplib length logic checks to make sure things aren't chunked. I view this test in the same vein as receiving a Content-Length of "foo". It shouldn't happen but do we want to actually return the content length in the event urllib3 happens to receive both? I'd say no, the length should be None because it can't be determined.

Alternatively, we could raise an exception here but I'm not sure if that would buy us anything useful other than aborting the operation.

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Yeah, this boils down to "how do we want to deal with this"? The options are as follows:

  1. Raise an exception explaining what went wrong. This fails fast and clearly.
  2. Fall back to no content-length. That means we'll treat the body as chunked. If it's not, we'll fail fast (IncompleteRead, usually). If it is, everything works.

I think that's probably ok in this case, but I'd like a comment explaining the rationale.

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Possibly also a log at warning level to explain what we're doing.

resp = HTTPResponse(fp, headers=headers, preload_content=False)
data = resp.stream(2)
next(data)
assert resp.length == 3

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Given this behaviour, would it be more apt to call this field length_remaining?

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

I'm fine with renaming it, I was just trying to keep this in sync with httplib.HTTPResponse.length so you can call length on either object and get back the same result.

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Hrrrrm. I don't think there's any requirement for compatibility in this case. =)

assert len(data) == 0
test_get(self)
test_head(self)

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

For the sake of clarity, I think I'd advise splitting these into two separate tests if possible.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

I'd definitely like to do that, I wasn't sure if there was a way to reuse the socket_handler without having to rewrite it for each test? I can if needed, I just didn't want to add redundant bloat if it wasn't needed.

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Sure, you can hoist the function up to e.g. module scope.

@@ -193,6 +193,14 @@ class ResponseNotChunked(ProtocolError, ValueError):
pass
class IncompleteRead(HTTPError):

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

This should probably inherit from httplib's IncompleteRead error class, which we already sometimes raise.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

Ok, great.

or request_method == 'HEAD'):
length = 0
return length

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

For now this is fine. In the future, I wonder if we shouldn't be more aggressive about refusing to serve responses with bogus content lengths.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

I'm ok with raising exceptions in the event of a bad header, but I think it should be done uniformly for the issues you brought up in the other comments. Returning None would only apply for Transfer-Encoding: chunked at that point. I guess as long as we leave the strict_content_length flag in place, this wouldn't be breaking.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

We should also definitely change the name, probably to your proposed length_remaining so people don't confuse it with length in httplib.

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

I think in the first instance we should be tolerant of these mistakes, and add some function. Let's build this up out of small incremental steps. =)

This comment has been minimized.

@nateprewitt

nateprewitt Nov 10, 2016

Contributor

@Lukasa are you seeing the more aggressive content-length validation as part of a urllib3 2.0/#someday feature, or something that would be integrated before a major (1.20+). It seems breaking, especially if people are using this for testing purposes. I wanted to record a few of the changes that were discussed here but passed on for later PRs.

This comment has been minimized.

@nateprewitt

nateprewitt Nov 10, 2016

Contributor

Nevermind, disregard this. The gain probably doesn't outweigh the loss. This would be breaking, it doesn't really provide significant benefit to the user, and will cripple some of the "do anything" functionality choices Kenneth wanted to preserve in Requests.

length = self.headers.get('content-length')
if length is not None and not self.chunked:
try:
length = int(length)

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

This is not quite good enough, as it leads us to subtly disregard RFC 7230 section 3.3.2:

If a message is received that has multiple Content-Length header fields with field-values consisting of the same decimal value, or a single Content-Length header field with a field value containing a list of identical decimal values (e.g., "Content-Length: 42, 42"), indicating that duplicate Content-Length header fields have been generated or combined by an upstream message processor, then the recipient MUST either reject the message as invalid or replace the duplicated field-values with a single valid Content-Length field containing that decimal value prior to determining the message body length or forwarding the message.

We should aim to handle this case either by rejecting all invalid content lengths, or by handling this case specially. Doing neither is pretty bad though. =(

This comment has been minimized.

@nateprewitt

nateprewitt Aug 24, 2016

Contributor

By rejecting, do you mean raising an exception in that event instead of setting length to None?

If we want to handle the case (which I think would be better than the exception), something like:

single_length = length.split(',')[0].strip()
length = int(single_length)

should work, right?

This comment has been minimized.

@Lukasa

Lukasa Aug 24, 2016

Contributor

Yes, rejecting would be raising an exception. =)

As to that approach, no, not quite. That also violates RFC 7230 which only wants us to accept the header if the entire list is identical. That is, we want more like this:

lengths = {val.strip() for val in header_val.split(',')}
if len(lengths) > 1:
    raise StupidHeaderBlockError("What are you thinking?")
length = int(lengths[0])
@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 24, 2016

Cool, this is a really good patch so far! I've added some notes here for strictness and other things which I think are fairly important: let me know if you have thoughts!

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 2 times, most recently from 1b05327 to d442647 Aug 25, 2016

fp = BytesIO(b'12345')
resp = HTTPResponse(fp, headers=headers, preload_content=False)
assert resp.length_remaining == 5

This comment has been minimized.

@sigmavirus24

sigmavirus24 Aug 27, 2016

Member

assertEqual

@@ -327,9 +389,19 @@ def read(self, amt=None, decode_content=None, cache_content=False):
# no harm in redundantly calling close.
self._fp.close()
flush_decoder = True
if (self.strict_content_length and
self.length_remaining not in (0, None)):

This comment has been minimized.

@sigmavirus24

sigmavirus24 Aug 27, 2016

Member

This doesn't look to follow the style of the rest of the project and should be indented more. I suspect you're also following PEP8 with a line length of 79. That is not what this project follows.

@sigmavirus24

This comment has been minimized.

Member

sigmavirus24 commented Aug 27, 2016

So I'm looking at our response class and we only initialize a decoder when the content-encoding specifies gzip or deflate. So we shouldn't always have a decoder to lean on. Why then is all of the logic in the DeflateDecoder class? Finally, a bit of a yak to shave: I'd prefer this flag be "enforce_content_length" rather than "strict_content_length". It's more descriptive.

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 27, 2016

@sigmavirus24 could you clarify this statement really quickly?

So I'm looking at our response class and we only initialize a decoder when the content-encoding specifies gzip or deflate. So we shouldn't always have a decoder to lean on. Why then is all of the logic in the DeflateDecoder class?

I'm not sure what logic you're seeing added on the DeflateDecoder class. All of the major logic changes should either be in _init_length or read which are both part of HTTPResponse.

@sigmavirus24

This comment has been minimized.

Member

sigmavirus24 commented Aug 27, 2016

Sorry for the confusion Nat. I didn't expand the diff enough and it looked as if you were adding to the DeflateDecoder class. Feel free to ignore that comment.

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 27, 2016

Thanks for the feedback Ian! Things should be updated.

@sigmavirus24

This comment has been minimized.

Member

sigmavirus24 commented Aug 27, 2016

@nateprewitt it looks like someone updated urllib3 to require your branch to be consistently rebased on top of master (easily one of GitHub's most annoying mistakes/misfeatures as it ties into other things people generally want). Can you rebase this as well please?

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch from 2bc4abe to 5b351aa Aug 28, 2016

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 28, 2016

Ok cool, things are rebased onto current master. All of @sigmavirus24's proposed changes should be in place. Once @haikuginger gives the thumbs up, I'll squash things down to a more manageable commit list and update CHANGES.

@sigmavirus24 sigmavirus24 changed the title from strict_content_length enforcement to Add `enforce_content_length` for responses Aug 28, 2016

"chunked.")
return None
if length is not None:

This comment has been minimized.

@haikuginger

haikuginger Aug 28, 2016

Contributor

This conditional is entirely supplemental to the one above. Possibly do elif length is not None?

@@ -327,9 +389,19 @@ def read(self, amt=None, decode_content=None, cache_content=False):
# no harm in redundantly calling close.
self._fp.close()
flush_decoder = True
if (self.enforce_content_length and
self.length_remaining not in (0, None)):

This comment has been minimized.

@haikuginger

haikuginger Aug 28, 2016

Contributor

This is still kinda counter to urllib3's style. I would break the second statement out into a method (self.has_remaining_data, perhaps?) so we can fit everything on one line easily without supplemental parentheses. Also, as @sigmavirus24 mentioned, we go with a 100-character line here, rather than the 79-character line recommended by PEP-8, so I'm not sure if you even need to do any refactoring to fit it into one line.

This comment has been minimized.

@nateprewitt

nateprewitt Aug 28, 2016

Contributor

Cool, it's a length of 95 expanded, so I went with that approach. Adding the check as another method seems a bit excessive if we're only going to use it here.

@haikuginger

This comment has been minimized.

Contributor

haikuginger commented Aug 28, 2016

Couple minor nits left.

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch 2 times, most recently from 84b6911 to 7c6c226 Aug 28, 2016

nateprewitt added some commits Aug 28, 2016

@nateprewitt nateprewitt force-pushed the nateprewitt:723_strict_content_length branch from 7c6c226 to 0a2a2dc Aug 28, 2016

@nateprewitt

This comment has been minimized.

Contributor

nateprewitt commented Aug 28, 2016

Alright, @Lukasa, @sigmavirus24, @haikuginger, I think everything has been addressed and I squashed the commits down into two separate feature commits. One for the length_remaining attribute and then building on that to implement enforce_content_length.

@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 28, 2016

Cool, I still like this. @haikuginger @sigmavirus24, are you two happy?

@sigmavirus24

This comment has been minimized.

Member

sigmavirus24 commented Aug 29, 2016

I've skimmed through (haven't downloaded it and played with it) but it 👀-only looks good.

@haikuginger

This comment has been minimized.

Contributor

haikuginger commented Aug 29, 2016

I'm on the same page as @sigmavirus24; haven't played with it, but it looks good.

@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 29, 2016

Ok, cool, I'm happy with this then. Thanks for the great work @nateprewitt!

@Lukasa

This comment has been minimized.

Contributor

Lukasa commented Aug 29, 2016

And thanks so much for the reviews @sigmavirus24 and @haikuginger, fantastic team job all around.

@Lukasa Lukasa merged commit 65b8c52 into urllib3:master Aug 29, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@haikuginger

This comment has been minimized.

Contributor

haikuginger commented Aug 29, 2016

Thanks for the excellent work, @nateprewitt! Way to drive this PR through. 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment