Revert "unicode_data refactors RUST-147622" #148436

jieyouxu · 2025-11-03T11:49:02Z

This PR reverts RUST-147622 for several reasons:

The RUST-147622 PR would format the generated core library code using an arbitrary rustfmt picked up from PATH, which will cause hard-to-debug failures when the rustfmt used to format the generated unicode data code versus the rustfmt used to format the in-tree library code produce incompatible formatting.
Previously, the unicode-table-generator tests were not run under CI as part of coretests, and since for x86_64-gnu-aux job we run library coretests with miri, the generated tests unfortunately caused an unacceptably large Merge CI time regression from ~2 hours to ~3.5 hours, making it the slowest Merge CI job (and thus the new bottleneck).
This PR also has an unintended effect of causing a diagnostic regression (RUST-148387), though that's mostly an edge case not properly handled by rustc diagnostics.

Given that these are three distinct causes with non-trivial fixes, I'm proposing to revert this PR to return us to baseline. This is not prejudice against relanding the changes with these issues addressed, but to alleviate time pressure to address these non-trivial issues.

FYI @Kmeakin @joboet (PR author/review). Note that these issues are very subtle, so you cannot be reasonably expected to know about them beforehand.

This was discussed in:

rustbot · 2025-11-03T11:49:04Z

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

rustbot · 2025-11-03T11:49:07Z

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

This PR reverts RUST-147622 for several reasons: 1. The RUST-147622 PR would format the generated core library code using an arbitrary `rustfmt` picked up from `PATH`, which will cause hard-to-debug failures when the `rustfmt` used to format the generated unicode data code versus the `rustfmt` used to format the in-tree library code. 2. Previously, the `unicode-table-generator` tests were not run under CI as part of `coretests`, and since for `x86_64-gnu-aux` job we run library `coretests` with `miri`, the generated tests unfortunately caused an unacceptably large Merge CI time regression from ~2 hours to ~3.5 hours, making it the slowest Merge CI job (and thus the new bottleneck). 3. This PR also has an unintended effect of causing a diagnostic regression (RUST-148387), though that's mostly an edge case not properly handled by `rustc` diagnostics. Given that these are three distinct causes with non-trivial fixes, I'm proposing to revert this PR to return us to baseline. This is not prejudice against relanding the changes with these issues addressed, but to alleviate time pressure to address these non-trivial issues.

joboet · 2025-11-03T11:55:21Z

I think it's quite simple to fix these issues:

We can remove the rustfmt invocation from the generator, a normal ./x fmt run after the generator will take care of formatting.
I'd just add a #[cfg(not(miri))] to the tests.
This issue is not exclusive to the PR.

How time-critical is this? I can whip up a PR for the two issues today.

Noratrieb · 2025-11-03T11:59:06Z

wrt 3. I'm concerned that an "NFC" refactor caused a change in behavior, that's the more alarming part than the diagnostics bug itself

Zalathar · 2025-11-03T12:01:22Z

The fact that this is adding ~30 minutes to every successful merge is IMO a good reason to want to revert the changes as soon as possible, without waiting for a fix-forward.

If it's easy to fix the problems, it's just as easy to fix them in a PR that reapplies the changes.

joboet · 2025-11-03T12:01:50Z

Oh, about the second issue: the tests are marked as #[cfg_attr(miri, ignore)], so the time issue probably stems from something else...

Kmeakin · 2025-11-03T12:02:20Z

For point 2, shouldn't #[cfg_attr(miri, ignore)] cause the tests to not be run on miri?

Zalathar · 2025-11-03T12:19:43Z

2025-11-01T08:50:02.0254739Z test unicode::grapheme_extend ... ignored
2025-11-01T08:50:02.1058441Z test unicode::lowercase ... ignored
2025-11-01T10:15:58.8090642Z test unicode::n ... ok
2025-11-01T10:15:58.8822592Z test unicode::to_lowercase ... ignored

rust/library/coretests/tests/unicode.rs

Lines 73 to 78 in f2bae99

    
           #[test] 
        
           fn n() { 
        
               test_boolean_property(test_data::N, unicode_data::n::lookup); 
        
           }

Kmeakin · 2025-11-03T12:22:11Z

Forgot to annotate n with #[cfg_attr(miri, ignore)] 🤦‍♂️

Zalathar · 2025-11-03T12:43:34Z

Let's revert this quickly, to undo the impact on CI times.

Discussion of potential fixes and remaining concerns can happen on the reapply PR.

@bors r+ p=6

bors · 2025-11-03T12:43:36Z

📌 Commit 4aeb297 has been approved by Zalathar

It is now in the queue for this repository.

Zalathar · 2025-11-03T12:46:32Z

Failure in rollup would be awkward, and it would be nice to see clean stats from the revert, so:

@bors rollup=never

joboet · 2025-11-03T12:47:31Z

Fixes are up at #148436.

bors · 2025-11-03T13:23:58Z

⌛ Testing commit 4aeb297 with merge f5711a5...

bors · 2025-11-03T16:28:51Z

☀️ Test successful - checks-actions
Approved by: Zalathar
Pushing f5711a5 to master...

github-actions · 2025-11-03T16:32:05Z

What is this?

This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.

Comparing 35ebdf9 (parent) -> f5711a5 (this PR)

Test differences

Show 31 test diffs

Stage 1

unicode::alphabetic: pass -> [missing] (J2)
unicode::case_ignorable: pass -> [missing] (J2)
unicode::cased: pass -> [missing] (J2)
unicode::grapheme_extend: pass -> [missing] (J2)
unicode::lowercase: pass -> [missing] (J2)
unicode::n: pass -> [missing] (J2)
unicode::to_lowercase: pass -> [missing] (J2)
unicode::to_uppercase: pass -> [missing] (J2)
unicode::uppercase: pass -> [missing] (J2)
unicode::white_space: pass -> [missing] (J2)

Stage 2

unicode::alphabetic: ignore -> [missing] (J0)
unicode::case_ignorable: ignore -> [missing] (J0)
unicode::cased: ignore -> [missing] (J0)
unicode::grapheme_extend: ignore -> [missing] (J0)
unicode::lowercase: ignore -> [missing] (J0)
unicode::to_lowercase: ignore -> [missing] (J0)
unicode::to_uppercase: ignore -> [missing] (J0)
unicode::uppercase: ignore -> [missing] (J0)
unicode::white_space: ignore -> [missing] (J0)
unicode::alphabetic: pass -> [missing] (J1)
unicode::case_ignorable: pass -> [missing] (J1)
unicode::cased: pass -> [missing] (J1)
unicode::grapheme_extend: pass -> [missing] (J1)
unicode::lowercase: pass -> [missing] (J1)
unicode::to_lowercase: pass -> [missing] (J1)
unicode::to_uppercase: pass -> [missing] (J1)
unicode::uppercase: pass -> [missing] (J1)
unicode::white_space: pass -> [missing] (J1)
unicode::n: pass -> [missing] (J3)

Additionally, 2 doctest diffs were found. These are ignored, as they are noisy.

Job group index

Test dashboard

Run

cargo run --manifest-path src/ci/citool/Cargo.toml -- \
    test-dashboard f5711a55f5d5e2f942057d0f6d648dd2d8b2c37b --output-dir test-dashboard

And then open test-dashboard/index.html in your browser to see an overview of all executed tests.

Job duration changes

x86_64-gnu-aux: 10859.2s -> 6448.4s (-40.6%)
dist-apple-various: 3500.9s -> 4241.0s (+21.1%)
dist-aarch64-apple: 7218.1s -> 8434.1s (+16.8%)
x86_64-gnu-llvm-20: 2332.7s -> 2635.4s (+13.0%)
x86_64-gnu: 6502.4s -> 7330.3s (+12.7%)
dist-loongarch64-linux: 5014.4s -> 5619.8s (+12.1%)
dist-riscv64-linux: 4696.0s -> 5194.8s (+10.6%)
dist-x86_64-msvc-alt: 8838.4s -> 9739.1s (+10.2%)
dist-various-1: 3844.3s -> 4190.0s (+9.0%)
dist-x86_64-apple: 7351.9s -> 7967.7s (+8.4%)

How to interpret the job duration changes?

Job durations can vary a lot, based on the actual runner instance
that executed the job, system noise, invalidated caches, etc. The table above is provided
mostly for t-infra members, for simpler debugging of potential CI slow-downs.

rust-timer · 2025-11-03T17:49:09Z

Finished benchmarking commit (f5711a5): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Our benchmarks found a performance regression caused by this PR.
This might be an actual regression, but it can also be just noise.

Next Steps:

If the regression was expected or you think it can be justified,
please write a comment with sufficient written justification, and add
@rustbot label: +perf-regression-triaged to it, to mark the regression as triaged.
If you think that you know of a way to resolve the regression, try to create
a new PR with a fix for the regression.
If you do not understand the regression or you think that it is just noise,
you can ask the @rust-lang/wg-compiler-performance working group for help (members of this group
were already notified of this PR).

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.2%	[0.1%, 0.3%]	5
Improvements ✅ (primary)	-0.8%	[-1.0%, -0.4%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.8%	[-1.0%, -0.4%]	3

Max RSS (memory usage)

Results (primary 0.9%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	3.5%	[3.5%, 3.5%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.7%	[-1.7%, -1.7%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.9%	[-1.7%, 3.5%]	2

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.1%, 0.2%]	7
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.1%	[-0.1%, -0.1%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.1%	[-0.1%, 0.2%]	10

Bootstrap: 474.155s -> 474.205s (0.01%)
Artifact size: 390.84 MiB -> 390.90 MiB (0.01%)

Kobzol · 2025-11-03T20:57:49Z

Performance-wish it's a wash, and this is a revert anyway.

@rustbot label: +perf-regression-triaged

tgross35 · 2025-11-04T02:12:44Z

Fixes are up at #148436.

That's this PR, I assume you meant #148438

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 3, 2025

rustbot assigned joboet Nov 3, 2025

jieyouxu force-pushed the revert-unicode-generator branch from 1a4b577 to 4aeb297 Compare November 3, 2025 11:53

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 3, 2025

joboet mentioned this pull request Nov 3, 2025

reland and fix RUST-147622 #148438

Open

bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 3, 2025

bors merged commit f5711a5 into rust-lang:master Nov 3, 2025
12 checks passed

rustbot added this to the 1.93.0 milestone Nov 3, 2025

bors mentioned this pull request Nov 3, 2025

Remove Cased Unicode table #146180

Open

rustbot added the perf-regression Performance regression. label Nov 3, 2025

bors mentioned this pull request Nov 3, 2025

Move more code out of unicode-table-generator into core #148365

Closed

rustbot added the perf-regression-triaged The performance regression has been triaged. label Nov 3, 2025

Kobzol mentioned this pull request Nov 3, 2025

Rollup of 9 pull requests #148337

Merged

jieyouxu deleted the revert-unicode-generator branch November 3, 2025 23:24

Revert "unicode_data refactors RUST-147622" #148436

Revert "unicode_data refactors RUST-147622" #148436

Uh oh!

Conversation

jieyouxu commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Nov 3, 2025

Uh oh!

rustbot commented Nov 3, 2025

Uh oh!

joboet commented Nov 3, 2025

Uh oh!

Noratrieb commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zalathar commented Nov 3, 2025

Uh oh!

joboet commented Nov 3, 2025

Uh oh!

Kmeakin commented Nov 3, 2025

Uh oh!

Zalathar commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kmeakin commented Nov 3, 2025

Uh oh!

Zalathar commented Nov 3, 2025

Uh oh!

bors commented Nov 3, 2025

Uh oh!

Zalathar commented Nov 3, 2025

Uh oh!

joboet commented Nov 3, 2025

Uh oh!

bors commented Nov 3, 2025

Uh oh!

bors commented Nov 3, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 3, 2025

Test differences

Stage 1

Stage 2

Job duration changes

Uh oh!

rust-timer commented Nov 3, 2025

Overall result: ❌✅ regressions and improvements - please read the text below

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

Kobzol commented Nov 3, 2025

Uh oh!

tgross35 commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jieyouxu commented Nov 3, 2025 •

edited

Loading

Noratrieb commented Nov 3, 2025 •

edited

Loading

Zalathar commented Nov 3, 2025 •

edited

Loading