Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS lookups in libsyntax_pos are expensive #59718

Open
nnethercote opened this issue Apr 5, 2019 · 26 comments

Comments

@nnethercote
Copy link
Contributor

commented Apr 5, 2019

#59693 is a nice speed-up for rustc, reducing instruction counts by as much as 12%. #59693 (comment) shows that approximately half the speedup is from avoiding TLS lookups.

So I thought: what else is using TLS lookups? I did some profiling and found that syntax_pos::GLOBALS accounts for most of it. It has three pieces, symbol_interner, hygiene_data, span_interner. I did some profiling of the places where they are accessed via GLOBALS::with:

rustc:
791545069 counts:
(  1) 499029030 (63.0%, 63.0%):     symbol_interner
(  2) 181386140 (22.9%, 86.0%):     hygiene_data
(  3) 109861627 (13.9%, 99.8%):     span_interner

ripgrep:
5455319 counts:
(  1)  2819190 (51.7%, 51.7%):     symbol_interner
(  2)  2015746 (37.0%, 88.6%):     hygiene_data
(  3)   599975 (11.0%, 99.6%):     span_interner

style-servo
79839701 counts:
(  1) 36436621 (45.6%, 45.6%):     hygiene_data
(  2) 31539114 (39.5%, 85.1%):     symbol_interner
(  3) 11562409 (14.5%, 99.6%):     span_interner

webrender
27006839 counts:
(  1) 11021232 (40.8%, 40.8%):     hygiene_data
(  2)  9218693 (34.1%, 74.9%):     symbol_interner
(  3)  6707365 (24.8%, 99.8%):     span_interner

These measurements are from a rustc that didn't have #59693's change applied, which avoids almost all of the span_interner accesses. And those accesses were only 11.0-24.8% of the syntax_pos::GLOBALS accesses. In other words, if we could eliminate most or all of the hygiene_data and symbol_interner accesses, we'd get even bigger wins than what we saw in #59693.

I admit that I don't understand how syntax_pos::GLOBALS works, why the TLS reference is needed for a global value.

One possible idea is to increase the size of Symbol from 4 bytes to 8 bytes, and then store short symbols (7 bytes or less) inline. Some preliminary profiling suggests this could capture roughly half of the symbols. hygiene_data is a harder nut to crack, being a more complicated structure.

cc @rust-lang/wg-compiler-performance

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Could these be sped up by using #[thread_local] directly, and maybe static linking instead of dynamic linking?
cc @alexcrichton

I admit that I don't understand how syntax_pos::GLOBALS works, why the TLS reference is needed for a global value.

If you're asking why it's not a plain static, that's because those should pretty much never be used, as rustc supports multiple instances per process (and e.g. rustdoc uses that to compile doc tests).

@Zoxc

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

#59655 allows you to compare symbols against a predefined list of symbols without doing a TLS lookup and a string comparison. That will hopefully help some.

I'm also working on a PR which removes Symbol usage from symbol names (which tend to be unique and doesn't benefit from interning).

I'd also like to replace the u32 in Symbol with &'tcx str, but that would require adding a lifetime to the AST.

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

I think arena-allocating the AST is the way forward anyway, so I wouldn't mind the lifetime tbh.

@mati865

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Is it related to #25088?

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

@mati865 We can figure out by trying to use a #[thread_local] static GLOBALS: Cell<Option<...>> = Cell::new(None); directly.

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

While #[thread_local] can be used to test performance AFAIK it still doesn't work on MSVC. We do in fact already know that dynamic linking has a hit on performance wrt instruction counts. As to whether that's PLT lookups vs thread local lookups I'm not sure. (I'm hoping to revive that once I get access to Windows again)

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

If you're asking why it's not a plain static, that's because those should pretty much never be used, as rustc supports multiple instances per process (and e.g. rustdoc uses that to compile doc tests).

I'm asking why a global data structure requires TLS to access it... global data structures and TLS seem entirely orthogonal and incompatible to me. Clearly I'm missing something. What does "multiple instances per process" mean -- instances of what?

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

Rustdoc will use rustc_driver and a set of other APIs to essentially attempt to call rustc as if it was a function. That spawns a thread (or more, with the parallel compiler enabled); each of those threads receives its own copy of these proto-globals; that means that they aren't necessarily global in the standard sense -- more so rustc-local.

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

@nnethercote All "globals" in rustc are "thread-local globals" - as in, they're "global" in the sense of "accessible from a function with no arguments" but scoped to a thread.

And by "rustc supports multiple instances" I meant "multiple instances of itself", i.e. multiple rustc invocations, running concurrently, on disjoint threads, but not interfering eachother.
(But @Mark-Simulacrum explained it better anyway)

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

cc #59749 (Measure upper limit for performance of 32 bit Span)

The same thing can be measured for the symbol interner as well, I guess, to estimate the impact.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

That spawns a thread (or more, with the parallel compiler enabled)

So "thread" doesn't actually mean OS thread, but a rustc invocation that contains one or more OS threads, depending on whether rustc is serial or parellel. And GLOBALS isn't properly global, but only global w.r.t. a single rustc invocation.

These names are... well... I now feel more justified about my prior confusion. I've seen the word "session" used in the code, does that match "rustc invocation" as I've used it above?

I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global?

How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else?

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

The threads do correspond to OS threads. However, my understanding is that GLOBALS is Session-like (just available earlier in the compilation session). I believe that's your understanding as well.

Yes, sessions are rustc "invocation" specific.

I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global?

Yes, the TLS just contains a pointer to the actual "global."

How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else?

My understanding is that doc tests would be considerably slower if we didn't have this in-process multi-invocationy style of building tests. I don't think it's used for anything else, necessarily, beyond perhaps unit tests in a few compiler tests.

I think historically the scoped TLS in the compiler has been used as an implicit context for things like Span, TyCtxt, etc. where there's some associated state that we don't currently thread through manually. I think it's possible that over time we could migrate away from TLS and towards other methods of threading the state through (and/or true globals via e.g. lazy_static) but I am unsure if that's feasible. I think historically it's not really been viable to completely remove (we use it too much, and it may be better than the alternative).

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

We certainly do not consider "true globals" a reasonable limitation for "rustc as a library" (not to mention they'd need locks in cases where today we can use Cell/RefCell), and likely RLS would be impacted too (at least before we add multi-crate sessions to rustc).

Ideally we'd move to some language-integrated "implicit contexts" but that is nowhere near on the horizon.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Apr 29, 2019

A problem with the current Symbol implementation is that if you want to convert a Symbol to a string -- which is common -- you have to access the array within Interner, which involves TLS.

I tried changing Symbol so that instead of an index into the interner, it just held a raw (thin) pointer to the string's chars, which avoids this problem. (This required putting the string length in the arena next to the chars. A fat pointer would have made Symbol 16 bytes, which is much too big.) And I also made the interner truly global (using lazy_static) and immortal -- symbols added are never removed. This makes it simpler because there's no subtle reasoning about lifetimes like the current implementation. (And the distinction between InternedString and LocalInternedString might not be necessary.)

I got it working, but unfortunately it was a clear slowdown of a few percent.

@ishitatsuyuki

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

I got it working, but unfortunately it was a clear slowdown of a few percent.

I don't get which part of your change made this slow - as it now doesn't use TLS isn't it supposed to be faster? Or the indirect string length made things slower?

Random note: if we're still struggling with 64-bit pointers, maybe it's time for a Java-like 32-bit mode.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

I improved things and now the performance is roughly the same -- some workloads are slightly better, some are slightly worse. A Cachegrind diff suggests that the ones that are worse are mostly because Token is bigger, and that hurts parsing.

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I don't think this is going down the right route, IMO, we should be making TLS cheaper (or relying on explicitly passed contexts where possible), rather than adding pointers everywhere.

@nnethercote Doesn't making it global cross-thread mean you now pay for locking where there was none before? I don't think we should be using lazy_static! or static except where impossible otherwise.

Tearing down a compiler instance in a process should not leave around leaked garbage, IMO (and keep in mind that, at least for a while longer, gensyms create fresh symbols, so something like RLS would just keep leaking memory).

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

Doesn't making it global cross-thread mean you now pay for locking where there was none before

As I understand, the locking will be there anyway once #[cfg(parallel_compiler)] is enabled by default, so the comparison would be fair if #[cfg(parallel_compiler)] were used with the "old" scheme.
However, I'm not sure whether @nnethercote did that or not.

Tearing down a compiler instance in a process should not leave around leaked garbage, IMO (and keep in mind that, at least for a while longer, gensyms create fresh symbols, so something like RLS would just keep leaking memory).

Hmm, looks like it's not possible to link to a Discord message, so I'll copypaste:

One thing with "one-interner-per-process" is that long running processes with many sessions (not just few like rustc or rustdoc) will need to garbage-collect it sometimes.
Setting a "session barrier" point (with no session being able to "live through" such a point) and simply emptying the interner at that point should be enough though.

So, if we provide some minimal garbage collection interface, RLS will be able to avoid leaks.

lazy_static

The const eval cannot create an empty HashMap yet?
We should teach it to do that sooner!
(As I understand, lazy_static introduces one more unnecessary synchronization on access.)

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

So, if we provide some minimal garbage collection interface, RLS will be able to avoid leaks.

That works with indices, and !Send encapsulated pointers, but not &'static str or even just cross-thread Symbol access (I can create a detached thread and intern something there).

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

I don't think this is going down the right route, IMO

I had an idea, I tried it, it didn't improve things, and I'm just reporting my experience in case it's interesting or helpful to others. Don't worry, I haven't filed a PR.

But I don't like the current interner design. I think the InternedString/LocalInternedString split is confusing, and the way TLS is used to pretend that there aren't lifetimes is un-Rust-like.

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I wouldn't personally mind an indexing-like system, but propagating the lifetimes through the entirety of rustc is a bit much, so it might get some opposition.
And you still need a context if you want the benefits of integers over pointers.
But at least a lifetime for interning/arena-allocation is becoming more common, with the query system.

IMO Rust could have properly statically checked implicit contexts with efficient access (i.e. not relying on TLS at all), but we're nowhere near a design for that, so we make do with what we've got.

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I should mention, as a data point, that the TyCtxt access via TLS in rustc was originally meant to be strictly for printing, as there is no other way to make fmt::{Display,Debug} work out of the box, and that printing was for debug logging and diagnostics, so not normal successful operation.

It was a compromise and maybe we need to keep it at that, relying on whatever contexts we have on hand instead, for most operations.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

To expand on what @eddyb said, here's an alternative design.

  • The Globals are put within ParseSess.
  • Whenever possible, we access the Globals via a tcx or sess, which avoids TLS.
  • We provide TLS access as well, for places where accessing via another structure is difficult.

This is already how TyCtxt works, see ImplicitCtxt.

This would give a combination of speed and convenience.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

So I thought: what else is using TLS lookups? I did some profiling and found that syntax_pos::GLOBALS accounts for most of it. It has three pieces, symbol_interner, hygiene_data, span_interner

The following PRs have reduced symbol_interner accesses by 45-60% across several of the benchmarks in rustc-perf: #60467, #60910, #60973, #61035. This reduced instruction counts by 1 or 2% across various workloads.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Also, the following PRs have reduced hygiene_data accesses by ~50-80% across the benchmarks in rustc-perf: #61253, #61484.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Here are some updated numbers, compared to the old numbers from the first comment above.

ripgrep
OLD 5455319 counts:
(  1)  2819190 (51.7%, 51.7%): symbol_interner
(  2)  2015746 (37.0%, 88.6%): hygiene_data
(  3)   599975 (11.0%, 99.6%): span_interner
NEW 2832882 counts:
(  1)  2090461 (73.8%, 73.8%): symbol_interner
(  2)   690417 (24.4%, 98.2%): hygiene_data
(  3)    33072 ( 1.2%, 99.3%): edition + hygiene_data

style-servo
OLD 79839701 counts:
(  1) 36436621 (45.6%, 45.6%): hygiene_data
(  2) 31539114 (39.5%, 85.1%): symbol_interner
(  3) 11562409 (14.5%, 99.6%): span_interner
NEW 27488566 counts:
(  1) 14725327 (53.6%, 53.6%): symbol_interner
(  2) 11787379 (42.9%, 96.4%): hygiene_data
(  3)   670264 ( 2.4%, 98.9%): edition + hygiene_data

webrender
OLD 27006839 counts:
(  1) 11021232 (40.8%, 40.8%): hygiene_data
(  2)  9218693 (34.1%, 74.9%): symbol_interner
(  3)  6707365 (24.8%, 99.8%): span_interner
NEW 11621430 counts:
(  1)  7081870 (60.9%, 60.9%): symbol_interner
(  2)  4284983 (36.9%, 97.8%): hygiene_data
(  3)   202049 ( 1.7%, 99.5%): edition + hygiene_data

The NEW measurements are for debug builds; the OLD measurements may have been for clean builds, in which case the improvements are even better than they appear. (symbol_interner is used quite a bit during debug code generation; I'm not sure about hygiene_data and edition.)

I got some nice wins from the abovementioned PRs, but nothing as big as #59693. I estimated earlier that "approximately half the speedup is from avoiding TLS lookups", but in hindsight I think that's an overestimate. The span_interner lookups avoided in #59693 all involved a TLS access and a hashtable lookup, while a much smaller fraction of the symbol_interner and hygiene_data lookups involved a hashtable lookup, which probably explains why the speeds on the PRs involve those structures weren't as large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.