Skip to content

Conversation

@joboet
Copy link
Member

@joboet joboet commented Nov 3, 2025

#147622 was reverted in #148436 due to some problems. This PR relands #147622 and addresses the issues described in the revert by:

  1. Removing the rustfmt invocation from the generator and telling users to run ./x fmt instead.
  2. Adding a missing #[cfg_attr(miri, ignore)] to the n test.
  3. Renaming the unicode_data::n module to number to reduce collisions.

Only ab82fbe is a new commit, the others are taken unmodified from #147622.

CC @Kmeakin @jieyouxu

@rustbot
Copy link
Collaborator

rustbot commented Nov 3, 2025

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented Nov 3, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@Kobzol
Copy link
Member

Kobzol commented Nov 3, 2025

@bors try jobs=x86_64-gnu-aux

rust-bors bot added a commit that referenced this pull request Nov 3, 2025
fix fallout from RUST-147622

try-job: x86_64-gnu-aux
@rust-bors

This comment has been minimized.

@rust-bors
Copy link

rust-bors bot commented Nov 3, 2025

☀️ Try build successful (CI)
Build commit: 66d8c54 (66d8c5408cb9e160c146ede875bce0f74483e738, parent: 35ebdf9ba1414456dfe1cb6a6b13ebae80e99734)

@rust-timer
Copy link
Collaborator

Queued 66d8c54 with parent 35ebdf9, future comparison URL.
There are currently 0 preceding artifacts in the queue.
It will probably take at least ~1.1 hours until the benchmark run finishes.

@bors
Copy link
Collaborator

bors commented Nov 3, 2025

☔ The latest upstream changes (presumably #148436) made this pull request unmergeable. Please resolve the merge conflicts.

Kmeakin and others added 6 commits November 3, 2025 19:32
Instead of `include_str!()`ing `range_search.rs`, just make it a normal
module under `core::unicode`. This means the same source code doesn't
have to be checked in twice, and it plays nicer with IDEs.

Also rename it to `rt` since it includes functions for searching the
bitsets and case conversion tables as well as the range
represesentation.
Remove `#[rustfmt::skip]` from all the generated modules in
`unicode_data.rs`. This means we won't have to worry so much about
getting indetation and formatting right when generating code.

Exempted for now some tables which would be too big when formatted by
`rustfmt`.
This check was made redundant (it will always be true) when we removed
all ASCII characters from the tables
(rust-lang@a8c6694).
To make the final output code easier to see:
* Get rid of the unnecessary line-noise of `.unwrap()`ing calls to
  `write!()` by moving the `.unwrap()` into a macro.
* Join consecutive `write!()` calls using a single multiline format
  string.
* Replace `.push()` and `.push_str(format!())` with `write!()`.
* If after doing all of the above, there is only a single `write!()`
  call in the function, just construct the string directly with
  `format!()`.
Instead of generating a standalone executable to test `unicode_data`,
generate normal tests in `coretests`. This ensures tests are always
generated, and will be run as part of the normal testsuite.

Also change the generated tests to loop over lookup tables, rather than
generating a separate `assert_eq!()` statement for every codepoint. The
old approach produced a massive (20,000 lines plus) file which took
minutes to compile!
@joboet joboet force-pushed the unicode_data_fixes branch from f34f268 to ab82fbe Compare November 3, 2025 18:35
@rustbot rustbot added the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Nov 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented Nov 3, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@joboet joboet changed the title fix fallout from RUST-147622 reland and fix RUST-147622 Nov 3, 2025
fn run(self, builder: &Builder<'_>) {
let mut cmd = builder.tool_cmd(Tool::UnicodeTableGenerator);
cmd.arg(builder.src.join("library/core/src/unicode/unicode_data.rs"));
cmd.arg(builder.src.join("library/coretests/tests/unicode/test_data.rs"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a short comment noting that these two command-line arguments are output paths for generated code?

That's useful information for bootstrap maintainers who aren't already familiar with unicode-table-generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please note that this actually generates tests under coretests, which gets picked up as library tests.

@Zalathar
Copy link
Member

Zalathar commented Nov 4, 2025

One of the difficulties I had with #147622 was that the PR title and description did not describe the true scope and scale of the changes, which made it much harder to track down the build-duration regression.

Could you please update the title and description of this PR to more accurately describe what's being changed?

@clarfonthey
Copy link
Contributor

If you could also include this commit I added in a separate PR, it would be appreciated. Or, you could move the verification to a coretest, which I am also okay writing as well.

The amount of const-time code added into the generator was a bit much, and it should be kept in libcore instead.

@Zalathar
Copy link
Member

Zalathar commented Nov 4, 2025

I would also suggest that using the word “refactor” for high-impact cross-subsystem changes is, from a maintenance perspective, downright misleading.

@clarfonthey
Copy link
Contributor

clarfonthey commented Nov 4, 2025

I mean, I would argue that the term "refactor" by itself is vague in almost all cases and a general rule to avoid it in favour of being more descriptive is probably best. As in, if you wanted to create a rustbot rule to tell people to consider rewording the PR title whenever the word "refactor" is in it, I would consider that reasonable.

You can describe most code changes as "refactors," and I wouldn't disagree with them necessarily being refactors, but I do think that the term isn't very helpful for understanding what the changes are.

}

#[test]
#[cfg_attr(miri, ignore)]
Copy link
Contributor

@clarfonthey clarfonthey Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These definitely should have a comment explaining why they aren't run on miri, since it really isn't clear looking at the tests alone.

Also, I'm a little confused why this isn't triggering a dead_code lint for test_boolean_property on miri as well…

Comment on lines +1671 to +1683
const {
let mut i = 0;
while i < UPPERCASE_TABLE.len() {
let (_, val) = UPPERCASE_TABLE[i];
if val & (1 << 22) == 0 {
assert!(char::from_u32(val).is_some());
} else {
let index = val & ((1 << 22) - 1);
assert!((index as usize) < UPPERCASE_TABLE_MULTI.len());
}
i += 1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test function using const {} to generate asserts?

Copy link
Member

@workingjubilee workingjubilee Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accepted idiom for having a large test case that is executed during const-eval, like this, is to simply declare a const _: () =, which should not be nested inside a function for no particular reason. Such a nesting mostly serves the purpose of obfuscating what the function actually does at runtime at best, and potentially having funny results when the code flows through any sort of MIR interpreter at worst (const eval or our friend miri), because the const {} should not matter to the function's actual execution but can be visible in differing ways (that I honestly don't fully understand the nuances of) at various points during compilation to MIR.

Comment on lines +59 to +71
const {{
let mut i = 0;
while i < {case_upper}CASE_TABLE.len() {{
let (_, val) = {case_upper}CASE_TABLE[i];
if val & (1 << 22) == 0 {{
assert!(char::from_u32(val).is_some());
}} else {{
let index = val & ((1 << 22) - 1);
assert!((index as usize) < {case_upper}CASE_TABLE_MULTI.len());
}}
i += 1;
}}
}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my beef is with this here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.