Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table-less string interning #2217

Closed
kmcallister opened this issue Apr 23, 2014 · 9 comments
Closed

Table-less string interning #2217

kmcallister opened this issue Apr 23, 2014 · 9 comments

Comments

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Apr 23, 2014

The traditional approach for string interning is:

  • Interned strings are pointers
  • To intern a string, look in a hash table mapping strings to pointers
  • If it's not found, allocate a copy of the string owned by the hash table, and insert it

Unfortunately this hash table needs to be shared between all threads that are creating interned strings which might be compared to each other. If we are parsing CSS in parallel and two threads independently intern the same class name, we will incorrectly interpret the class names as not equal.

There's been much discussion (#1153, #1135) about what kind of concurrent hash table we should use for this.

Here is an alternative that does not require such a table:

  • Interned strings are 128 bits
  • To intern a string of 16 bytes or fewer, just store it "as is"
  • To intern a string of more than 16 bytes, hash it using a 128-bit cryptographic hash function and store that. If the hash function is not broken, we can assume collisions don't happen.

To prevent an attacker-chosen immediate string from colliding with a known hash value, set the top two bits of the first hash byte to 10, which makes the first byte a UTF-8 continuation byte. Also use hashing for any string containing NULL, so that immediate strings don't need to include their length.

So we can intern strings and compare them for equality without any shared state. But there's a third operation we need to support: getting the original characters of an interned string. For this we do need to build a table mapping hash values to strings. But each thread can build this table independently, since they will always get the same result. We can merge them later, in parallel with selector matching etc. Only when we need to read out original strings (for CSSOM?) do we block on construction of the merged table. And we don't need it at all for strings ≤ 16 bytes.

So the things to investigate here are:

  • How much does a 128-bit compare cost, vs. 64 bits for a pointer?
  • How much does bloating the DOM cost, in terms of memory usage and cache pressure?
  • Can we make the cryptographic hash fast enough? On my machine, SHA-256 of a 64-byte string takes about 500 ns (openssl speed sha256). Rust already uses SipHash which is supposed to be fast and crypto-strength. But we would need to run it twice to get 128 bits; 64 bits is not enough to assume collisions don't happen.
  • How much does it cost to build and merge the hash → string tables?

We can study some of this using standalone programs before testing this approach in Servo.

I should also note that we can do the small-string optimization even when the non-optimized case is a pointer. But in that case we'll only optimize 4- or 8-byte strings.

@jruderman
Copy link
Contributor

@jruderman jruderman commented Apr 24, 2014

Sounds like this would make thread-safety rely on the security of the hash function (after weakening it with the "10" trick) and luck. That makes me uneasy.

@kripken
Copy link

@kripken kripken commented Apr 24, 2014

Intentional collisions are very unlikely unless the cryptographic hash is broken, correct, but very rare unintentional collisions are still possible, so this might lead to false results of equality that are very rare and hard to debug, unless I'm missing something?

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Apr 24, 2014

Generating a collision in a cryptographic hash is a publishable security result, though—unintentional or not. I don't think we need to worry much there.

The "10" trick is unfortunate but I don't think it'll be much of a problem, as it only reduces the strength of the hash by two bits—we should check with crypto folks to be sure, though. (For example, djb thought the weakening of a hash by mod-ing it by the number of hash table buckets was bad enough to warrant the creation of SipHash—there may be gotchas like that.)

very rare unintentional collisions are still possible

So going by this SO answer [1] the probability of a collision with a 126-bit perfect hash function is about (10^9)/(2^(127)) or about 5.877 * 10^-30. I'm OK with taking that chance :)

Of course, that assumes that the 128-bit hash we're using is a perfect hash when weakened, which of course is not guaranteed.

@kripken
Copy link

@kripken kripken commented Apr 24, 2014

An unintentional cryptographic hash collision is not a publishable result, it's just very unlikely? The risk is moving from a model which is 100% sound to one in which random (very unlikely) errors are possible.

@kmcallister
Copy link
Contributor Author

@kmcallister kmcallister commented Apr 24, 2014

Disclaimer: I am not a cryptographer, etc.

djb thought the weakening of a hash by mod-ing it by the number of hash table buckets was bad enough to warrant the creation of SipHash

I think this will only affect non-cryptographic hashes, i.e. those feasibly distinguishable from random functions.

It's common in cryptography to assume that hashes like SHA-256 have basically an independent random output for each distinct input. In that case we're on solid ground truncating to 126 (sic) bits.

(In truth, SHA-256 fails as a random oracle because it's vulnerable to length extension attacks. These aren't at all relevant to us, but it means that truncated SHA-256 is actually closer to a random oracle than SHA-256 itself.)

However I think that brute-force search for a collision on a 128-bit hash is pretty feasible. You can (allegedly) buy a 1.6 TH/s Bitcoin miner for $5,000. At the same rate of SHA-256 evaluation, you would find a collision in its 128-bit truncation in expected 3 months (2^64 evaluations). I'm not sure whether you could actually repurpose this particular hardware, but it still gives a rough estimate of the difficulty.

So we could use, say, a 192-bit hash, which would change 3 months to about 1 billion years. Or we could mix in a per-session random secret before hashing; I think that would prevent this attack (at some cost) but I'll think about it more.

@kmcallister
Copy link
Contributor Author

@kmcallister kmcallister commented Apr 24, 2014

The risk is moving from a model which is 100% sound to one in which random (very unlikely) errors are possible.

You'll always have random unlikely errors, e.g. random bit flips in memory. The trick is just to get your algorithmic error rate much lower than those.

(By the way, there is some cool work on exploiting random bit flips to compromise JVM memory safety, or effectively hijack DNS.)

@kripken
Copy link

@kripken kripken commented Apr 25, 2014

Fair point. I am still concerned by the fact that we can't actually estimate the risk of collision - we are assuming it is perfectly random. Granted, that is covered by the "if you can prove it isn't random, that's publishable" statement, but it might well be less than random regardless of the practicality of proving it so.

@kmcallister
Copy link
Contributor Author

@kmcallister kmcallister commented Apr 25, 2014

Dropping a link to the mailing list discussion for future reference.

@glennw
Copy link
Member

@glennw glennw commented Oct 17, 2014

String interning is implemented now - we can open a new bug if/when we need to revisit any performance issues.

@glennw glennw closed this Oct 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.