Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite native thread-local storage #116123

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joboet
Copy link
Contributor

@joboet joboet commented Sep 24, 2023

(part of #110897)

The current native thread-local storage implementation has become quite messy, uses indescriptive names and unnecessarily adds code to the macro expansion. This PR tries to fix that by using a new implementation that also allows more layout optimizations and potentially increases performance by eliminating unnecessary TLS accesses.

This does not change the recursive initialization behaviour I described in this comment, so it should be a library-only change. Changing that behaviour should be quite easy now, however.

r? @m-ou-se
@rustbot label +T-libs

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 24, 2023
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@joboet
Copy link
Contributor Author

joboet commented Nov 24, 2023

@rustbot author
until I fix CI.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 24, 2023
@joboet joboet marked this pull request as draft November 24, 2023 12:34
Comment on lines 100 to 102
fn __init() -> $t {
$init
}
Copy link
Contributor

@m-rph m-rph Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been impl'ing a lint in clippy to suggest const if possible, and it relies on this structure existing and returning exactly $init. As long as this doesn't change, the lint should keep working.

rust-lang/rust-clippy#12015

@bors
Copy link
Contributor

bors commented Jan 13, 2024

☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts.

@Dylan-DPC
Copy link
Member

@joboet any updates on this? thanks

@joboet
Copy link
Contributor Author

joboet commented Feb 15, 2024

@rustbot label +S-blocked

I want to do some other cleanups before this can be merged.

@rustbot rustbot added the S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. label Feb 15, 2024
@rustbot rustbot added A-testsuite Area: The testsuite used to check the correctness of rustc T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Mar 12, 2024
@joboet joboet marked this pull request as ready for review March 12, 2024 13:04
@joboet
Copy link
Contributor Author

joboet commented Mar 12, 2024

@rustbot ready

The other improvements proposed by #110897 become much easier once this is merged, so I'm prioritising this one again.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. labels Mar 12, 2024
@rust-log-analyzer

This comment has been minimized.

@joboet
Copy link
Contributor Author

joboet commented Mar 16, 2024

I'm curious as to whether this affects performance. This should reduce the number of TLS accesses, after all.
@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 16, 2024
@joboet
Copy link
Contributor Author

joboet commented Mar 18, 2024

I've split up the implementation into different types for const/lazy initialization, so we are no longer at the mercy of the optimizer to recognize a discriminant update to the state enum without copying the TLS value around. This is probably better from a readability standpoint as well, even though it duplicates some very similar code.

Let's try perf again.
@bors try @rust-timer queue

@rust-timer

This comment was marked as outdated.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 18, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 18, 2024
Rewrite native thread-local storage

(part of rust-lang#110897)

The current native thread-local storage implementation has become quite messy, uses indescriptive names and unnecessarily adds code to the macro expansion. This PR tries to fix that by using a new implementation that also allows more layout optimizations and potentially increases performance by eliminating unnecessary TLS accesses.

This does not change the recursive initialization behaviour I described in [this comment](rust-lang#110897 (comment)), so it should be a library-only change. Changing that behaviour should be quite easy now, however.

r? `@m-ou-se`
`@rustbot` label +T-libs
@bors

This comment was marked as outdated.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@joboet
Copy link
Contributor Author

joboet commented Mar 18, 2024

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 18, 2024
Rewrite native thread-local storage

(part of rust-lang#110897)

The current native thread-local storage implementation has become quite messy, uses indescriptive names and unnecessarily adds code to the macro expansion. This PR tries to fix that by using a new implementation that also allows more layout optimizations and potentially increases performance by eliminating unnecessary TLS accesses.

This does not change the recursive initialization behaviour I described in [this comment](rust-lang#110897 (comment)), so it should be a library-only change. Changing that behaviour should be quite easy now, however.

r? `@m-ou-se`
`@rustbot` label +T-libs
@bors
Copy link
Contributor

bors commented Mar 18, 2024

⌛ Trying commit 6bc1647 with merge 765aea1...

@bors
Copy link
Contributor

bors commented Mar 18, 2024

☀️ Try build successful - checks-actions
Build commit: 765aea1 (765aea13f10fbcaea0da13cb3ad88f9d65be842d)

1 similar comment
@bors
Copy link
Contributor

bors commented Mar 18, 2024

☀️ Try build successful - checks-actions
Build commit: 765aea1 (765aea13f10fbcaea0da13cb3ad88f9d65be842d)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (765aea1): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.2%, 0.4%] 4
Regressions ❌
(secondary)
1.3% [1.3%, 1.3%] 3
Improvements ✅
(primary)
-0.8% [-2.2%, -0.3%] 6
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.4% [-2.2%, 0.4%] 10

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
6.4% [0.8%, 18.7%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.2% [-9.3%, -0.2%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.8% [-9.3%, 18.7%] 7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.7% [1.6%, 1.7%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.5% [-2.1%, -0.9%] 2
Improvements ✅
(secondary)
-6.1% [-6.1%, -6.1%] 1
All ❌✅ (primary) 0.1% [-2.1%, 1.7%] 4

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.0%, 0.6%] 33
Regressions ❌
(secondary)
0.1% [0.0%, 1.4%] 38
Improvements ✅
(primary)
-0.2% [-0.5%, -0.0%] 22
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [-0.5%, 0.6%] 55

Bootstrap: 667.419s -> 668.667s (0.19%)
Artifact size: 312.79 MiB -> 312.84 MiB (0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 18, 2024
@joboet
Copy link
Contributor Author

joboet commented Mar 20, 2024

That's definitely better. Still, some more regressions than I'd hoped. My guess is that LLVM misses some optimizations on const initialized locals because the state is now in the same global as the data. I don't know how to best resolve that in the general case, as that separation is the cause of the improvements. In the specific case of proc-macros, I think we can optimize the TLS usage a bit, but that's for another PR.

Given that the average performance has improved, I'd say that the regressions are worth it. What do you think, @m-ou-se?

@saethlin
Copy link
Member

The cachegrind diffs for the regressions are still dominated by __tls_get_addr so I think it's wrong to blame missed optimizations.

@joboet
Copy link
Contributor Author

joboet commented Mar 20, 2024

The cachegrind diffs for the regressions are still dominated by __tls_get_addr so I think it's wrong to blame missed optimizations.

No, it is not wrong.

For const-initialized TLS variables, which is what is used in the regressions, the old implementation has at least two TLS references (on the fast path, which is what counts here). One is for the state and the other for the data. This PR reduces that to just one reference.

I'll interpret an increase in instructions for __tls_get_addr as an increase in the number of calls to that function. As the number of TLS references in the rustc generated IR has reduced, this can only mean that either

  • the number of TLS references that LLVM was able to remove in the old implementation is more than twice that of the new implementation
  • LLVM generates more calls to __tls_get_addr for the new implementation than necessary. This can happen because LLVM sees thread-local references as references to globals, which it doesn't like to cache. This was supposedly fixed for Linux, but still, the way the IR works there's always a risk of introducing more calls.

In either of those cases, the problem is bad optimization.

@bors
Copy link
Contributor

bors commented Apr 6, 2024

☔ The latest upstream changes (presumably #123339) made this pull request unmergeable. Please resolve the merge conflicts.

@bors
Copy link
Contributor

bors commented Apr 27, 2024

☔ The latest upstream changes (presumably #124428) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants