Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String interning #1153

Closed
pcwalton opened this issue Oct 30, 2013 · 14 comments
Closed

String interning #1153

pcwalton opened this issue Oct 30, 2013 · 14 comments

Comments

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Oct 30, 2013

Currently we are 23x slower than Gecko on CSS selector matching due primarily, it seems, to the lack of interning. We should have a solution for that.

This blog post describes a data structure that may be useful for interning, thanks to sanxiyn (#1142): http://da-data.blogspot.kr/2013/03/optimistic-cuckoo-hashing-for.html

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Oct 30, 2013

This is related to #1135.

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Nov 5, 2013

Today (2013-11-04) jason_ in #rust expressed interest in implementing that concurrent hashtable and indicated they might do it by the end of the week.

Also we could just bind to the C library but it's not that much code and I think a native Rust version would be better. Plus I'm not sure that code exactly matches the document linked above; for example it uses a pthreads mutex for writes.

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Nov 5, 2013

If we want something that supports growth, recursive-split-ordering hash tables are pretty great stuff:
http://www.cs.tau.ac.il/~afek/SplitOrderListHashSS03.pdf

I had to survey a lot of this literature as part of my dissertation (relaxed-semantics memoization tables are similar to but also entirely different from hash tables) and would be happy to chat with / help whomever is interested in this work. It would be good to chat with People In The Know about the expected usage pattern to help us choose. e.g., known size up-front? many expected hits or few? is there a lot of concurrent access, or are our use cases fairly sequential?

Also, in general, looking at what java.utils.concurrent did is never a bad idea:
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/main/java/util/concurrent/

I've met Doug Lea a few times and he's both really on top of the research literature and deeply involved in real-world engineering of concurrent data structures.

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Nov 5, 2013

If we want to intern arbitrary strings and not leak memory, we need to free unused strings at some point, meaning some kind of reference counting or garbage collection.

Given this, do we really want to share interned strings across tasks?

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Nov 5, 2013

Related:
#282
https://github.com/mozilla/servo/wiki/Strings
https://docs.google.com/document/d/1kOCUlJdh2WJMJGDf-WoEQhmnjKLaOYRbiHz5TiGJl14/edit?pli=1

The latter link says that Blink does not share strings between threads because their reference counting is not thread-safe.

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Nov 5, 2013

Given this, do we really want to share interned strings across tasks?

My understanding is that we need to, because we want to create DOM nodes in the script task with interned id, class, etc. and then use those in layout during selector matching.

For that alone we wouldn't need atomic refcounting, because we already have measures in place to prevent DOM nodes from disappearing while layout is looking at them. A per-script-task interning table with task-local refcounting from the DOM nodes would suffice. In that case we don't need a concurrent hashtable at all.

But we also want to do parallel CSS and HTML parsing, and I think this implies both a concurrent hashtable and atomic refcounting for its elements. Unless there's a separate single-threaded "intern everything" pass, but I'm not sure how much of a performance hit we'd take there.

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Nov 6, 2013

@metajack also says on IRC that we do have thread-safe reference counting, it’s called Arc.

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Nov 6, 2013

That's right, it's in libextra.

But it should be slower than task-local refcounting; I don't know by how much.

@ryanhc
Copy link
Contributor

@ryanhc ryanhc commented Nov 20, 2013

I would like to continue to work on this string interning, #1135, but didn't have a chance to talk to you guys when you were visiting samsung. Patrick briefly told me he had some ideas on how to solve this problem, so, @pcwalton, could you share your ideas, in particular, whether we need to make it thread-safe, and if so, designs that you have in mind? That would be very helpful :)

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Nov 21, 2013

There were some notes in this thread on the mailing list:
https://groups.google.com/forum/#!topic/mozilla.dev.servo/GeFlCPqgmA0

@kmcallister kmcallister changed the title Intern class and ID names String interning Apr 23, 2014
@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Apr 23, 2014

Interning was also discussed at the 2014-02-10 Servo meeting.

I'm doing interning for the HTML parser, and would like to coordinate with interning work elsewhere in Servo. For now I plan to intern only strings from a set known at compile time (e.g. tag names). That should be pretty much orthogonal to whatever we end up doing for dynamic interning.

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Apr 25, 2014

The mailing list discussion about table-less interning contains some other ideas relevant to interning generally.

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Apr 25, 2014

I started doing some interning in the HTML parser.

@glennw
Copy link
Member

@glennw glennw commented Oct 22, 2014

string-cache is implemented now.

@glennw glennw closed this Oct 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.