Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upString interning #1153
String interning #1153
Comments
|
This is related to #1135. |
|
Today (2013-11-04) Also we could just bind to the C library but it's not that much code and I think a native Rust version would be better. Plus I'm not sure that code exactly matches the document linked above; for example it uses a |
|
If we want something that supports growth, recursive-split-ordering hash tables are pretty great stuff: I had to survey a lot of this literature as part of my dissertation (relaxed-semantics memoization tables are similar to but also entirely different from hash tables) and would be happy to chat with / help whomever is interested in this work. It would be good to chat with People In The Know about the expected usage pattern to help us choose. e.g., known size up-front? many expected hits or few? is there a lot of concurrent access, or are our use cases fairly sequential? Also, in general, looking at what java.utils.concurrent did is never a bad idea: I've met Doug Lea a few times and he's both really on top of the research literature and deeply involved in real-world engineering of concurrent data structures. |
|
If we want to intern arbitrary strings and not leak memory, we need to free unused strings at some point, meaning some kind of reference counting or garbage collection. Given this, do we really want to share interned strings across tasks? |
|
Related: The latter link says that Blink does not share strings between threads because their reference counting is not thread-safe. |
My understanding is that we need to, because we want to create DOM nodes in the script task with interned id, class, etc. and then use those in layout during selector matching. For that alone we wouldn't need atomic refcounting, because we already have measures in place to prevent DOM nodes from disappearing while layout is looking at them. A per-script-task interning table with task-local refcounting from the DOM nodes would suffice. In that case we don't need a concurrent hashtable at all. But we also want to do parallel CSS and HTML parsing, and I think this implies both a concurrent hashtable and atomic refcounting for its elements. Unless there's a separate single-threaded "intern everything" pass, but I'm not sure how much of a performance hit we'd take there. |
|
@metajack also says on IRC that we do have thread-safe reference counting, it’s called Arc. |
|
That's right, it's in libextra. But it should be slower than task-local refcounting; I don't know by how much. |
|
I would like to continue to work on this string interning, #1135, but didn't have a chance to talk to you guys when you were visiting samsung. Patrick briefly told me he had some ideas on how to solve this problem, so, @pcwalton, could you share your ideas, in particular, whether we need to make it thread-safe, and if so, designs that you have in mind? That would be very helpful :) |
|
There were some notes in this thread on the mailing list: |
|
Interning was also discussed at the 2014-02-10 Servo meeting. I'm doing interning for the HTML parser, and would like to coordinate with interning work elsewhere in Servo. For now I plan to intern only strings from a set known at compile time (e.g. tag names). That should be pretty much orthogonal to whatever we end up doing for dynamic interning. |
|
The mailing list discussion about table-less interning contains some other ideas relevant to interning generally. |
|
I started doing some interning in the HTML parser. |
|
string-cache is implemented now. |
Currently we are 23x slower than Gecko on CSS selector matching due primarily, it seems, to the lack of interning. We should have a solution for that.
This blog post describes a data structure that may be useful for interning, thanks to sanxiyn (#1142): http://da-data.blogspot.kr/2013/03/optimistic-cuckoo-hashing-for.html