Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upTable-less string interning #2217
Comments
|
Sounds like this would make thread-safety rely on the security of the hash function (after weakening it with the "10" trick) and luck. That makes me uneasy. |
|
Intentional collisions are very unlikely unless the cryptographic hash is broken, correct, but very rare unintentional collisions are still possible, so this might lead to false results of equality that are very rare and hard to debug, unless I'm missing something? |
|
Generating a collision in a cryptographic hash is a publishable security result, though—unintentional or not. I don't think we need to worry much there. The "10" trick is unfortunate but I don't think it'll be much of a problem, as it only reduces the strength of the hash by two bits—we should check with crypto folks to be sure, though. (For example, djb thought the weakening of a hash by mod-ing it by the number of hash table buckets was bad enough to warrant the creation of SipHash—there may be gotchas like that.)
So going by this SO answer [1] the probability of a collision with a 126-bit perfect hash function is about (10^9)/(2^(127)) or about 5.877 * 10^-30. I'm OK with taking that chance :) Of course, that assumes that the 128-bit hash we're using is a perfect hash when weakened, which of course is not guaranteed. |
|
An unintentional cryptographic hash collision is not a publishable result, it's just very unlikely? The risk is moving from a model which is 100% sound to one in which random (very unlikely) errors are possible. |
|
Disclaimer: I am not a cryptographer, etc.
I think this will only affect non-cryptographic hashes, i.e. those feasibly distinguishable from random functions. It's common in cryptography to assume that hashes like SHA-256 have basically an independent random output for each distinct input. In that case we're on solid ground truncating to 126 (sic) bits. (In truth, SHA-256 fails as a random oracle because it's vulnerable to length extension attacks. These aren't at all relevant to us, but it means that truncated SHA-256 is actually closer to a random oracle than SHA-256 itself.) However I think that brute-force search for a collision on a 128-bit hash is pretty feasible. You can (allegedly) buy a 1.6 TH/s Bitcoin miner for $5,000. At the same rate of SHA-256 evaluation, you would find a collision in its 128-bit truncation in expected 3 months (2^64 evaluations). I'm not sure whether you could actually repurpose this particular hardware, but it still gives a rough estimate of the difficulty. So we could use, say, a 192-bit hash, which would change 3 months to about 1 billion years. Or we could mix in a per-session random secret before hashing; I think that would prevent this attack (at some cost) but I'll think about it more. |
You'll always have random unlikely errors, e.g. random bit flips in memory. The trick is just to get your algorithmic error rate much lower than those. (By the way, there is some cool work on exploiting random bit flips to compromise JVM memory safety, or effectively hijack DNS.) |
|
Fair point. I am still concerned by the fact that we can't actually estimate the risk of collision - we are assuming it is perfectly random. Granted, that is covered by the "if you can prove it isn't random, that's publishable" statement, but it might well be less than random regardless of the practicality of proving it so. |
|
Dropping a link to the mailing list discussion for future reference. |
|
String interning is implemented now - we can open a new bug if/when we need to revisit any performance issues. |
The traditional approach for string interning is:
Unfortunately this hash table needs to be shared between all threads that are creating interned strings which might be compared to each other. If we are parsing CSS in parallel and two threads independently intern the same class name, we will incorrectly interpret the class names as not equal.
There's been much discussion (#1153, #1135) about what kind of concurrent hash table we should use for this.
Here is an alternative that does not require such a table:
To prevent an attacker-chosen immediate string from colliding with a known hash value, set the top two bits of the first hash byte to
10, which makes the first byte a UTF-8 continuation byte. Also use hashing for any string containingNULL, so that immediate strings don't need to include their length.So we can intern strings and compare them for equality without any shared state. But there's a third operation we need to support: getting the original characters of an interned string. For this we do need to build a table mapping hash values to strings. But each thread can build this table independently, since they will always get the same result. We can merge them later, in parallel with selector matching etc. Only when we need to read out original strings (for CSSOM?) do we block on construction of the merged table. And we don't need it at all for strings ≤ 16 bytes.
So the things to investigate here are:
openssl speed sha256). Rust already uses SipHash which is supposed to be fast and crypto-strength. But we would need to run it twice to get 128 bits; 64 bits is not enough to assume collisions don't happen.We can study some of this using standalone programs before testing this approach in Servo.
I should also note that we can do the small-string optimization even when the non-optimized case is a pointer. But in that case we'll only optimize 4- or 8-byte strings.