paste: support multi-byte delimiters and GNU escape sequences #10840

ChrisDryden · 2026-02-09T19:42:15Z

This PR is fairly large in scope and has a few design decisions that probably warrant some additional discussion. I have been prototyping many approaches to how to deal with the issue of getting the multi-byte character sequence lengths and all of the solutions I could think of that would call the libc implementation ultimately would take a bunch of unsafe code and require FFI's and still create compatibility issues when it comes to different platforms not supporting that API.

To work backwards from what the goal was, I went through all of the glibc locales to see which ones actually had multi-byte decoding rules that were unique from UTF-8 and how many of those overlapped with one another. That led to to discover that currently there are only 5 different rule sets for this calculation and that you can determine which encoding to use from the env variables for which locale to use. Then it appears that some of the locales are hardcoded to specific encodings and the only two that I could find when searching all of the glibc locales was "zh_CN" | "zh_SG" and "zh_TW" | "zh_HK"

There were two new GNU tests that were related to this type of locale stuff: paste.pl and multi-byte.sh that this PR addresses.

This can help with the other issues in the queue:
#10184
#9712
#7600
#3075
#5831
#3997

ChrisDryden · 2026-02-09T19:47:30Z

Seems like github is down?

oech3 · 2026-02-09T19:53:18Z

I think you can re-run 1 job only instead of force-pushing,

oech3 · 2026-02-09T20:10:22Z

util/build-gnu.sh

 grep -rl '\$abs_path_dir_' tests/*/*.sh | xargs -r "${SED}" -i "s|\$abs_path_dir_|${UU_BUILD_DIR//\//\\/}|g"
+# Some tests use $abs_top_builddir/src for shebangs: point them to the uutils build dir.
+grep -rl '\$abs_top_builddir/src' tests/*/*.sh tests/*/*.pl | xargs -r "${SED}" -i "s|\$abs_top_builddir/src|${UU_BUILD_DIR//\//\\/}|g"
+grep -rl '\$ENV{abs_top_builddir}/src' tests/*/*.pl | xargs -r "${SED}" -i "s|\$ENV{abs_top_builddir}/src|${UU_BUILD_DIR//\//\\/}|g"


@pixelb Additional abs_top_builddir

Whoops that is unrelated but it changes two skips to passes, was working on that locally and it got added

~~But does not mean that we are not using uutils binaries at here if we don't sed?~~

According to the logs I think we deleted the gnu coreutils binaries from that env so it means that it just skips because its unable to find a binary.

Can we simply symlink our bins to abs_top_builddir for all tests at once?

github-actions · 2026-02-09T20:17:49Z

GNU testsuite comparison:

GNU test failed: tests/pr/bounded-memory. tests/pr/bounded-memory is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!
Congrats! The gnu test tests/tail/tail-n0f is now passing!

github-actions · 2026-02-09T21:53:21Z

GNU testsuite comparison:

Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!
Congrats! The gnu test tests/tail/tail-n0f is now passing!

sylvestre · 2026-02-09T21:54:34Z

oh, nice

sylvestre · 2026-02-09T21:58:51Z

src/uucore/src/lib/features/i18n/charmap.rs

+/// Returns 1 for empty, invalid, or incomplete sequences.
+pub fn mb_char_len(bytes: &[u8]) -> usize {
+    if bytes.is_empty() {
+        return 1;


returning 1 when empty is confusing
can you please document why ?

sylvestre · 2026-02-09T21:59:46Z

src/uu/paste/src/paste.rs

-                        translate!("paste-error-delimiter-unescaped-backslash", "delimiters" => delimiters),
-                    ));
+fn parse_delimiters(delimiters: &OsString) -> UResult<Box<[Box<[u8]>]>> {
+    let bytes = uucore::os_string_to_vec(delimiters.clone())?;


can we use OsStr::as_bytes() on Unix or borrowing instead of cloning the OsString ?

sylvestre · 2026-02-09T22:00:33Z

src/uu/paste/src/paste.rs

+                _ => {
+                    // Unknown escape: strip backslash, use the following character(s)
+                    let len = mb_char_len(&bytes[i..]);
+                    vec.push(Box::from(&bytes[i..i + len]));


Potential panic if mb_char_len returns non-zero but bytes[i..] is shorter than returned length, no?

…ar_len

github-actions · 2026-02-10T04:26:49Z

GNU testsuite comparison:

Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pr/bounded-memory is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!

ChrisDryden force-pushed the paste-multibyte-delimiters branch from 5865c10 to 321cb08 Compare February 9, 2026 19:43

This comment was marked as outdated.

Sign in to view

ChrisDryden force-pushed the paste-multibyte-delimiters branch from 321cb08 to b084aaa Compare February 9, 2026 19:51

ChrisDryden force-pushed the paste-multibyte-delimiters branch from b084aaa to 1650f0e Compare February 9, 2026 20:06

oech3 reviewed Feb 9, 2026

View reviewed changes

paste: support multi-byte delimiters and GNU escape sequences

143fec3

ChrisDryden force-pushed the paste-multibyte-delimiters branch from 1650f0e to 143fec3 Compare February 9, 2026 21:22

sylvestre reviewed Feb 9, 2026

View reviewed changes

paste: address review comments - avoid OsString clone and guard mb_ch…

0e25a22

…ar_len

sylvestre merged commit b4a4a38 into uutils:main Feb 10, 2026
157 checks passed

Uh oh!

paste: support multi-byte delimiters and GNU escape sequences #10840

paste: support multi-byte delimiters and GNU escape sequences #10840

Uh oh!

Conversation

ChrisDryden commented Feb 9, 2026

Uh oh!

ChrisDryden commented Feb 9, 2026

Uh oh!

This comment was marked as outdated.

oech3 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oech3 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

sylvestre commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oech3 commented Feb 9, 2026 •

edited

Loading

oech3 Feb 9, 2026 •

edited

Loading