HPACK encoder doesn't handle duplicate headers appropriately #50

Lukasa · 2014-04-22T15:50:08Z

As seen in http2jp/hpack-test-case#13, there's a very specific edge-case behaviour where the HPACK encoder incorrectly handles repeated headers. Specifically, if a header set contains two identical headers (both key and value) that is already in the reference set, we won't emit any headers, causing the output to contain only a single instance of that repeated header.

The nicest fix here is actually likely to be to fix up #36: repeated identical headers will therefore be concatenated together and will look different to the HPACK internals.

Lukasa · 2014-04-22T20:59:06Z

Right, the actual HPACK logic is easy. What's substantially harder is trying to represent the multiple header values on the far side. My current solution is a dictionary whose values are lists, but this causes huge pain in the tests.

That's probably ok, I'll see if the tests can be fixed tomorrow.

Lukasa · 2014-04-24T12:45:59Z

Actually, this isn't simple.

We need to handle two separate situations:

Headers with multiple values concatenated together using NULL bytes.
Identical headers sent multiple times.

The HPACK decoder uses Python sets internally, which allow for efficient differencing. Unfortunately, it doesn't allow us to easily have multiple headers in the same header set. Looks like this may require a more dramatic rearchitecting of the HPACK implementation.

Lukasa · 2014-04-25T10:38:27Z

Yeah, the rewrite here is pretty big, but the impact of this bug is low. I won't block releases behind this bug.

EDIT: This got discussed at length on the WG list. The short position is that hyper's HPACK implementation is totally wrong because of some subtleties in the spec.

Lukasa · 2014-05-12T18:15:43Z

So, given that this is a full redesign, may as well do my design out here. Open development, right?

Specific notes:

A header set is an unordered collection of key-value pairs, potentially containing duplicates.
The header table is a fixed-size ordered collection of key-value pairs, potentially containing duplicates.
The reference set is an unordered collection of references to elements in the header table, never containing duplicates.

Specific requirements:

Encoder will need to accept an iterable of tuples (the only Pythonic way to represent a header set).
Decoder will need to return an iterable of tuples (for exactly the same reason).

Notes:

Implementing the reference set in Python is a little awkward. If I want it to be a set, it needs to be a set of values that can be used to find the element in the header set. This can't be names or weakrefs, because they'll have the same hash collision problem that affects the current implementation. Indices into the header table is a terrible idea because we would need to increment/decrement them all each time the header table changes size, which is hugely expensive. What I need is a 'reference' type that hashes uniquely for each index, but compares equal. Easy enough to do, though it's pretty nasty.

Lukasa · 2014-05-12T18:34:25Z

Let's first consider decoding. The algorithm is as follows:

First, prepare a header set to be emitted. This set is initially empty.

Next, determine the representation of the element. This is one of: "indexed", "literal added to header table", "literal not added to header table".

If the representation is "indexed", check whether it's in the reference set (does anything in the reference set compare equal to the reference?). If it is, remove it from the reference set and do not emit the header. If it is not, and it references the static table: 1) emit the header; 2) insert the static entry at the start of the header table; 3) add a reference to the reference set. Otherwise, if it is not in the reference set and references the header table: 1) emit the header; 2) add a reference to the reference set.

If the representation is "literal not added to the header table", only emit the header.

If the representation is "literal added to the header table": 1) emit the header; 2) insert the header at the beginning of the header table; 3) add a reference to the reference set.

Finally, everything in the reference set that has not been emitted as part of this process gets added to the header set.

Typically, this last step is the most tricky bit. The core problem is that we need a way to work out whether a header in the reference set has been emitted yet. Fundamentally, any header that was added to the reference set and remains in there as part of this procedure was emitted, and none of the others were. However, we need a way to keep track of which ones we emitted.

Some options for this:

Have the header set be a combination of literal key-value pairs and 'reference set references', then have the 'reference set references' be converted to literals upon completing processing. This is tricky, as it's possible the 'reference set references' are referencing things that are no longer in the header table, so they'd need to keep copies. This is finnicky.
Have the reference set be marked as to whether they were added in this round of processing. Painful performance characteristics here because one way or another we need to touch every reference every round of processing.
Have a temporary set (an actual set) of references that were added to the set in this round of processing. This can be differenced against the reference set to find the rest. Probably the nicest, and currently what I'm leaning towards.

Lukasa · 2014-05-12T18:38:02Z

Alright, so how does this 'reference' object work?

The constraints are as follows:

A 'reference' must evaluate equal to any other reference to a specific entry in the header table.
A 'reference' must evaluate not equal to an reference to a different entry in the header table, even if that entry evaluates equal to the entry referenced by the other reference (what a terrible sentence).
For use with sets, the hashing rules apply as above as well.

The obvious implementation here is to use object identity on the tuples.

Lukasa · 2014-05-18T16:22:51Z

Ok, so the decoder is done and seems like it works.

Encoder time. This is a little harder. Big problem is repeated headers. If we have a unique header (i.e. name-value pair) then either it's in the reference set and we don't encode it, or it's not and we do encode it (and add it to the reference set). If we have a header that's repeated, each instance following the first needs to be encoded as two headers, one removing from the reference set and the other re-adding it (and emitting it again).

Some examples:

If header a is present four times in the header set and is not present in the reference set, it needs to be encoded 7 times: once to originally put it in the reference set, then three more instances of removing and re-adding.
If header a is present four times in the header set and is present in the reference set, it needs to be encoded 8 times: four instances of removing and re-adding.

Encoding this in a way that doesn't special-case is tricky, and I just don't think I can do it in a way that I really like. The problem is that searching the header set for all repeated headers every single time seems wildly inefficient. Is it? Not sure.

Lukasa · 2014-05-23T17:55:59Z

It's not inefficient if we use the Counter data structure! This is an awesome brainwave. =D

This rewrite was required because it turned out that certain assumptions in the original HPACK implementation were revealed to be untrue. Unfortunately, this rewrite adds substantial complexity to much of this code. See GitHub issue #50 for more details.

Lukasa · 2014-05-28T20:52:43Z

I ended up going a totally different route with this, using the ability of HPACK to be streamed. It involved a pretty gnarly rewrite, the sum-total of which can be seen here.

This passes all my tests including the fixed up HPACK integration tests. Time for me to go back and re-open http2jp/hpack-test-case#13 to confirm that hyper now functions appropriately.

Lukasa · 2014-06-01T08:50:22Z

Definitely fixed.

Lukasa added the Bug label Apr 22, 2014

Lukasa closed this as completed Jun 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPACK encoder doesn't handle duplicate headers appropriately #50

HPACK encoder doesn't handle duplicate headers appropriately #50

Lukasa commented Apr 22, 2014

Lukasa commented Apr 22, 2014

Lukasa commented Apr 24, 2014

Lukasa commented Apr 25, 2014

Lukasa commented May 12, 2014

Lukasa commented May 12, 2014

Lukasa commented May 12, 2014

Lukasa commented May 18, 2014

Lukasa commented May 23, 2014

Lukasa commented May 28, 2014

Lukasa commented Jun 1, 2014

HPACK encoder doesn't handle duplicate headers appropriately #50

HPACK encoder doesn't handle duplicate headers appropriately #50

Comments

Lukasa commented Apr 22, 2014

Lukasa commented Apr 22, 2014

Lukasa commented Apr 24, 2014

Lukasa commented Apr 25, 2014

Lukasa commented May 12, 2014

Lukasa commented May 12, 2014

Lukasa commented May 12, 2014

Lukasa commented May 18, 2014

Lukasa commented May 23, 2014

Lukasa commented May 28, 2014

Lukasa commented Jun 1, 2014