-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#9086] Trim whitespace in gzip encoder's accept-encoding check #747
Conversation
src/twisted/web/server.py
Outdated
| @@ -525,7 +525,8 @@ def encoderForRequest(self, request): | |||
| """ | |||
| acceptHeaders = request.requestHeaders.getRawHeaders( | |||
| 'accept-encoding', []) | |||
| supported = ','.join(acceptHeaders).split(',') | |||
| supported = [encoding.strip() | |||
| for encoding in ','.join(acceptHeaders).split(',')] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This join/split dance has to allocate roughly sum((len(h) for h in acceptHeaders)) + len(acceptHeaders) - 1 and then just throws it away.
In practice this is limited to something like 8kb because the total twisted request size is 16k with a maximum of 500 headers, so it's not like a single request is likely to exhaust all your memory by sending the maximum number of Accept-Encoding headers. But it's still a lot of malloc and memcpy.
Some simple loops are probably better?
for header in acceptHeaders:
for encoding in header.split(','):
if 'gzip' in encoding.strip():
encoding = ...
breakOf course, split and strip are also going to do a lot of copying, so maybe a regexp instead?
Something along the lines of
gzipRe = re.compile(r'[\s,]?gzip[\s,]?')and
for header in acceptHeaders:
if gzipRe.match('header'):
encoding = ...
breakThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for simple loops as they are easier to read :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that regexp is quite right:
>>> import re
>>> gzipRe = re.compile(br'[\s,]?gzip[\s,]?')
>>> gzipRe.match(b'gzip')
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> gzipRe.match(b'x-gzip')
>>> gzipRe.match(b'gzipzzz') # should not match
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> bool(gzipRe.match(b'deflate, gzip')) # match() is not right
FalseThe regexp should use ^ and $ to allow matches at the beginning and end of the string. search() should be used to allow non-prefix matches:
>>> gzipRe2 = re.compile(br'(:?^|[\s,])gzip(:?$|[\s,])')
>>> bool(gzipRe2.search(b'gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,br'))
True
>>> bool(gzipRe2.search(b'deflate,gzip,br'))
True
>>> bool(gzipRe2.search(b'x-gzip,br'))
False
>>> bool(gzipRe2.search(b'deflate,br'))
FalseI would keep everything as bytes so that we don't need to decode the header value.
Codecov Report
@@ Coverage Diff @@
## trunk #747 +/- ##
=========================================
- Coverage 91.85% 91.46% -0.4%
=========================================
Files 844 844
Lines 150573 150584 +11
Branches 13148 13148
=========================================
- Hits 138307 137729 -578
- Misses 10171 10718 +547
- Partials 2095 2137 +42 |
|
ubuntu16.04-py2.7-newstyle-coverage is failing. is this a bug in incremental? |
|
cc @glyph it seems to be fine now :) i think it was a git hiccup? |
| request = server.Request(self.channel, False) | ||
| request.gotLength(0) | ||
| request.requestHeaders.setRawHeaders(b"Accept-Encoding", | ||
| [b"deflate, gzip"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current test code hard to read... why we use gzip and not deflate?
I know that we only support .gz and .bz2 ... but the test should not make this assumption,
We might later add support for deflate, and then this test might fail ... and then you might not know why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a test of GzipEncoderFactory I don't think that this is particularly a problem. (I'd also wager we won't ever add support for deflate, since it's basically supported by a subset of the clients that gzip is, and historically clients are buggy. Brotli—br—would be a more useful addition.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree with @dreid that it would be better to avoid join/split in header processing here. Whether to use a regexp or not is up to you. I think that we have already demonstrated the potential hazards there. :)
- I'd also like to avoid decoding the header values—
bytesall the way! This will make Twisted more robust against odd inputs.
Thanks!
src/twisted/web/server.py
Outdated
| @@ -525,7 +525,8 @@ def encoderForRequest(self, request): | |||
| """ | |||
| acceptHeaders = request.requestHeaders.getRawHeaders( | |||
| 'accept-encoding', []) | |||
| supported = ','.join(acceptHeaders).split(',') | |||
| supported = [encoding.strip() | |||
| for encoding in ','.join(acceptHeaders).split(',')] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that regexp is quite right:
>>> import re
>>> gzipRe = re.compile(br'[\s,]?gzip[\s,]?')
>>> gzipRe.match(b'gzip')
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> gzipRe.match(b'x-gzip')
>>> gzipRe.match(b'gzipzzz') # should not match
<_sre.SRE_Match object; span=(0, 4), match=b'gzip'>
>>> bool(gzipRe.match(b'deflate, gzip')) # match() is not right
FalseThe regexp should use ^ and $ to allow matches at the beginning and end of the string. search() should be used to allow non-prefix matches:
>>> gzipRe2 = re.compile(br'(:?^|[\s,])gzip(:?$|[\s,])')
>>> bool(gzipRe2.search(b'gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,gzip'))
True
>>> bool(gzipRe2.search(b'deflate, gzip,br'))
True
>>> bool(gzipRe2.search(b'deflate,gzip,br'))
True
>>> bool(gzipRe2.search(b'x-gzip,br'))
False
>>> bool(gzipRe2.search(b'deflate,br'))
FalseI would keep everything as bytes so that we don't need to decode the header value.
| request = server.Request(self.channel, False) | ||
| request.gotLength(0) | ||
| request.requestHeaders.setRawHeaders(b"Accept-Encoding", | ||
| [b"deflate, gzip"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a test of GzipEncoderFactory I don't think that this is particularly a problem. (I'd also wager we won't ever add support for deflate, since it's basically supported by a subset of the clients that gzip is, and historically clients are buggy. Brotli—br—would be a more useful addition.)
3a8c0bd
to
2fd2c96
Compare
https://twistedmatrix.com/trac/ticket/9086