Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#9086] Trim whitespace in gzip encoder's accept-encoding check #747

Merged
merged 1 commit into from
Jul 1, 2018

Conversation

hawkowl
Copy link
Member

@hawkowl hawkowl commented Mar 19, 2017

@@ -525,7 +525,8 @@ def encoderForRequest(self, request):
"""
acceptHeaders = request.requestHeaders.getRawHeaders(
'accept-encoding', [])
supported = ','.join(acceptHeaders).split(',')
supported = [encoding.strip()
for encoding in ','.join(acceptHeaders).split(',')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This join/split dance has to allocate roughly sum((len(h) for h in acceptHeaders)) + len(acceptHeaders) - 1 and then just throws it away.

In practice this is limited to something like 8kb because the total twisted request size is 16k with a maximum of 500 headers, so it's not like a single request is likely to exhaust all your memory by sending the maximum number of Accept-Encoding headers. But it's still a lot of malloc and memcpy.

Some simple loops are probably better?

for header in acceptHeaders:
    for encoding in header.split(','):
        if 'gzip' in encoding.strip():
           encoding = ...
           break

Of course, split and strip are also going to do a lot of copying, so maybe a regexp instead?

Something along the lines of

gzipRe = re.compile(r'[\s,]?gzip[\s,]?')

and

for header in acceptHeaders:
    if gzipRe.match('header'):
        encoding = ...
        break

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for simple loops as they are easier to read :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that regexp is quite right:

>>> import re
>>> gzipRe = re.compile(br'[\s,]?gzip[\s,]?')
>>> gzipRe.match(b'gzip')
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> gzipRe.match(b'x-gzip')
>>> gzipRe.match(b'gzipzzz')  # should not match
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> bool(gzipRe.match(b'deflate, gzip'))  # match() is not right
False

The regexp should use ^ and $ to allow matches at the beginning and end of the string. search() should be used to allow non-prefix matches:

>>> gzipRe2 = re.compile(br'(:?^|[\s,])gzip(:?$|[\s,])')
>>> bool(gzipRe2.search(b'gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,br'))
True
>>> bool(gzipRe2.search(b'deflate,gzip,br'))
True
>>> bool(gzipRe2.search(b'x-gzip,br'))
False
>>> bool(gzipRe2.search(b'deflate,br'))
False

I would keep everything as bytes so that we don't need to decode the header value.

@codecov
Copy link

codecov bot commented Mar 20, 2017

Codecov Report

Merging #747 into trunk will decrease coverage by 0.39%.
The diff coverage is 56.25%.

@@            Coverage Diff            @@
##            trunk     #747     +/-   ##
=========================================
- Coverage   91.85%   91.46%   -0.4%     
=========================================
  Files         844      844             
  Lines      150573   150584     +11     
  Branches    13148    13148             
=========================================
- Hits       138307   137729    -578     
- Misses      10171    10718    +547     
- Partials     2095     2137     +42

@glyph
Copy link
Member

glyph commented Apr 26, 2017

ubuntu16.04-py2.7-newstyle-coverage is failing. is this a bug in incremental?

@hawkowl
Copy link
Member Author

hawkowl commented Apr 26, 2017

cc @glyph it seems to be fine now :) i think it was a git hiccup?

request = server.Request(self.channel, False)
request.gotLength(0)
request.requestHeaders.setRawHeaders(b"Accept-Encoding",
[b"deflate, gzip"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current test code hard to read... why we use gzip and not deflate?

I know that we only support .gz and .bz2 ... but the test should not make this assumption,

We might later add support for deflate, and then this test might fail ... and then you might not know why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a test of GzipEncoderFactory I don't think that this is particularly a problem. (I'd also wager we won't ever add support for deflate, since it's basically supported by a subset of the clients that gzip is, and historically clients are buggy. Brotli—br—would be a more useful addition.)

Copy link
Contributor

@twm twm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree with @dreid that it would be better to avoid join/split in header processing here. Whether to use a regexp or not is up to you. I think that we have already demonstrated the potential hazards there. :)
  2. I'd also like to avoid decoding the header values—bytes all the way! This will make Twisted more robust against odd inputs.

Thanks!

@@ -525,7 +525,8 @@ def encoderForRequest(self, request):
"""
acceptHeaders = request.requestHeaders.getRawHeaders(
'accept-encoding', [])
supported = ','.join(acceptHeaders).split(',')
supported = [encoding.strip()
for encoding in ','.join(acceptHeaders).split(',')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that regexp is quite right:

>>> import re
>>> gzipRe = re.compile(br'[\s,]?gzip[\s,]?')
>>> gzipRe.match(b'gzip')
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> gzipRe.match(b'x-gzip')
>>> gzipRe.match(b'gzipzzz')  # should not match
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> bool(gzipRe.match(b'deflate, gzip'))  # match() is not right
False

The regexp should use ^ and $ to allow matches at the beginning and end of the string. search() should be used to allow non-prefix matches:

>>> gzipRe2 = re.compile(br'(:?^|[\s,])gzip(:?$|[\s,])')
>>> bool(gzipRe2.search(b'gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,br'))
True
>>> bool(gzipRe2.search(b'deflate,gzip,br'))
True
>>> bool(gzipRe2.search(b'x-gzip,br'))
False
>>> bool(gzipRe2.search(b'deflate,br'))
False

I would keep everything as bytes so that we don't need to decode the header value.

request = server.Request(self.channel, False)
request.gotLength(0)
request.requestHeaders.setRawHeaders(b"Accept-Encoding",
[b"deflate, gzip"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a test of GzipEncoderFactory I don't think that this is particularly a problem. (I'd also wager we won't ever add support for deflate, since it's basically supported by a subset of the clients that gzip is, and historically clients are buggy. Brotli—br—would be a more useful addition.)

@hawkowl hawkowl force-pushed the 9086-trim-accept-encoding branch from 3a8c0bd to 2fd2c96 Compare July 1, 2018 12:28
@hawkowl hawkowl merged commit 2c52915 into trunk Jul 1, 2018
@hawkowl hawkowl deleted the 9086-trim-accept-encoding branch July 1, 2018 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants