Skip to content
This repository has been archived by the owner on Jan 13, 2021. It is now read-only.

HPACK encoder doesn't handle duplicate headers appropriately #50

Closed
Lukasa opened this issue Apr 22, 2014 · 10 comments
Closed

HPACK encoder doesn't handle duplicate headers appropriately #50

Lukasa opened this issue Apr 22, 2014 · 10 comments
Labels

Comments

@Lukasa
Copy link
Member

Lukasa commented Apr 22, 2014

As seen in http2jp/hpack-test-case#13, there's a very specific edge-case behaviour where the HPACK encoder incorrectly handles repeated headers. Specifically, if a header set contains two identical headers (both key and value) that is already in the reference set, we won't emit any headers, causing the output to contain only a single instance of that repeated header.

The nicest fix here is actually likely to be to fix up #36: repeated identical headers will therefore be concatenated together and will look different to the HPACK internals.

@Lukasa Lukasa added the Bug label Apr 22, 2014
@Lukasa
Copy link
Member Author

Lukasa commented Apr 22, 2014

Right, the actual HPACK logic is easy. What's substantially harder is trying to represent the multiple header values on the far side. My current solution is a dictionary whose values are lists, but this causes huge pain in the tests.

That's probably ok, I'll see if the tests can be fixed tomorrow.

@Lukasa
Copy link
Member Author

Lukasa commented Apr 24, 2014

Actually, this isn't simple.

We need to handle two separate situations:

  1. Headers with multiple values concatenated together using NULL bytes.
  2. Identical headers sent multiple times.

The HPACK decoder uses Python sets internally, which allow for efficient differencing. Unfortunately, it doesn't allow us to easily have multiple headers in the same header set. Looks like this may require a more dramatic rearchitecting of the HPACK implementation.

@Lukasa
Copy link
Member Author

Lukasa commented Apr 25, 2014

Yeah, the rewrite here is pretty big, but the impact of this bug is low. I won't block releases behind this bug.

EDIT: This got discussed at length on the WG list. The short position is that hyper's HPACK implementation is totally wrong because of some subtleties in the spec.

@Lukasa
Copy link
Member Author

Lukasa commented May 12, 2014

So, given that this is a full redesign, may as well do my design out here. Open development, right?

Specific notes:

  • A header set is an unordered collection of key-value pairs, potentially containing duplicates.
  • The header table is a fixed-size ordered collection of key-value pairs, potentially containing duplicates.
  • The reference set is an unordered collection of references to elements in the header table, never containing duplicates.

Specific requirements:

  • Encoder will need to accept an iterable of tuples (the only Pythonic way to represent a header set).
  • Decoder will need to return an iterable of tuples (for exactly the same reason).

Notes:

  • Implementing the reference set in Python is a little awkward. If I want it to be a set, it needs to be a set of values that can be used to find the element in the header set. This can't be names or weakrefs, because they'll have the same hash collision problem that affects the current implementation. Indices into the header table is a terrible idea because we would need to increment/decrement them all each time the header table changes size, which is hugely expensive. What I need is a 'reference' type that hashes uniquely for each index, but compares equal. Easy enough to do, though it's pretty nasty.

@Lukasa
Copy link
Member Author

Lukasa commented May 12, 2014

Let's first consider decoding. The algorithm is as follows:

First, prepare a header set to be emitted. This set is initially empty.

Next, determine the representation of the element. This is one of: "indexed", "literal added to header table", "literal not added to header table".

If the representation is "indexed", check whether it's in the reference set (does anything in the reference set compare equal to the reference?). If it is, remove it from the reference set and do not emit the header. If it is not, and it references the static table: 1) emit the header; 2) insert the static entry at the start of the header table; 3) add a reference to the reference set. Otherwise, if it is not in the reference set and references the header table: 1) emit the header; 2) add a reference to the reference set.

If the representation is "literal not added to the header table", only emit the header.

If the representation is "literal added to the header table": 1) emit the header; 2) insert the header at the beginning of the header table; 3) add a reference to the reference set.

Finally, everything in the reference set that has not been emitted as part of this process gets added to the header set.

Typically, this last step is the most tricky bit. The core problem is that we need a way to work out whether a header in the reference set has been emitted yet. Fundamentally, any header that was added to the reference set and remains in there as part of this procedure was emitted, and none of the others were. However, we need a way to keep track of which ones we emitted.

Some options for this:

  1. Have the header set be a combination of literal key-value pairs and 'reference set references', then have the 'reference set references' be converted to literals upon completing processing. This is tricky, as it's possible the 'reference set references' are referencing things that are no longer in the header table, so they'd need to keep copies. This is finnicky.
  2. Have the reference set be marked as to whether they were added in this round of processing. Painful performance characteristics here because one way or another we need to touch every reference every round of processing.
  3. Have a temporary set (an actual set) of references that were added to the set in this round of processing. This can be differenced against the reference set to find the rest. Probably the nicest, and currently what I'm leaning towards.

@Lukasa
Copy link
Member Author

Lukasa commented May 12, 2014

Alright, so how does this 'reference' object work?

The constraints are as follows:

  • A 'reference' must evaluate equal to any other reference to a specific entry in the header table.
  • A 'reference' must evaluate not equal to an reference to a different entry in the header table, even if that entry evaluates equal to the entry referenced by the other reference (what a terrible sentence).
  • For use with sets, the hashing rules apply as above as well.

The obvious implementation here is to use object identity on the tuples.

@Lukasa
Copy link
Member Author

Lukasa commented May 18, 2014

Ok, so the decoder is done and seems like it works.

Encoder time. This is a little harder. Big problem is repeated headers. If we have a unique header (i.e. name-value pair) then either it's in the reference set and we don't encode it, or it's not and we do encode it (and add it to the reference set). If we have a header that's repeated, each instance following the first needs to be encoded as two headers, one removing from the reference set and the other re-adding it (and emitting it again).

Some examples:

  • If header a is present four times in the header set and is not present in the reference set, it needs to be encoded 7 times: once to originally put it in the reference set, then three more instances of removing and re-adding.
  • If header a is present four times in the header set and is present in the reference set, it needs to be encoded 8 times: four instances of removing and re-adding.

Encoding this in a way that doesn't special-case is tricky, and I just don't think I can do it in a way that I really like. The problem is that searching the header set for all repeated headers every single time seems wildly inefficient. Is it? Not sure.

@Lukasa
Copy link
Member Author

Lukasa commented May 23, 2014

It's not inefficient if we use the Counter data structure! This is an awesome brainwave. =D

Lukasa added a commit that referenced this issue May 28, 2014
This rewrite was required because it turned out that certain assumptions
in the original HPACK implementation were revealed to be untrue.
Unfortunately, this rewrite adds substantial complexity to much of this
code. See GitHub issue #50 for more details.
@Lukasa
Copy link
Member Author

Lukasa commented May 28, 2014

I ended up going a totally different route with this, using the ability of HPACK to be streamed. It involved a pretty gnarly rewrite, the sum-total of which can be seen here.

This passes all my tests including the fixed up HPACK integration tests. Time for me to go back and re-open http2jp/hpack-test-case#13 to confirm that hyper now functions appropriately.

@Lukasa
Copy link
Member Author

Lukasa commented Jun 1, 2014

Definitely fixed.

@Lukasa Lukasa closed this as completed Jun 1, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant