-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
paste: support multi-byte delimiters and GNU escape sequences #10840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paste: support multi-byte delimiters and GNU escape sequences #10840
Conversation
5865c10 to
321cb08
Compare
|
Seems like github is down? |
This comment was marked as outdated.
This comment was marked as outdated.
321cb08 to
b084aaa
Compare
|
I think you can re-run 1 job only instead of force-pushing, |
b084aaa to
1650f0e
Compare
| grep -rl '\$abs_path_dir_' tests/*/*.sh | xargs -r "${SED}" -i "s|\$abs_path_dir_|${UU_BUILD_DIR//\//\\/}|g" | ||
| # Some tests use $abs_top_builddir/src for shebangs: point them to the uutils build dir. | ||
| grep -rl '\$abs_top_builddir/src' tests/*/*.sh tests/*/*.pl | xargs -r "${SED}" -i "s|\$abs_top_builddir/src|${UU_BUILD_DIR//\//\\/}|g" | ||
| grep -rl '\$ENV{abs_top_builddir}/src' tests/*/*.pl | xargs -r "${SED}" -i "s|\$ENV{abs_top_builddir}/src|${UU_BUILD_DIR//\//\\/}|g" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pixelb Additional abs_top_builddir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops that is unrelated but it changes two skips to passes, was working on that locally and it got added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does not mean that we are not using uutils binaries at here if we don't sed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the logs I think we deleted the gnu coreutils binaries from that env so it means that it just skips because its unable to find a binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply symlink our bins to abs_top_builddir for all tests at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
GNU testsuite comparison: |
1650f0e to
143fec3
Compare
|
GNU testsuite comparison: |
|
oh, nice |
| /// Returns 1 for empty, invalid, or incomplete sequences. | ||
| pub fn mb_char_len(bytes: &[u8]) -> usize { | ||
| if bytes.is_empty() { | ||
| return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning 1 when empty is confusing
can you please document why ?
src/uu/paste/src/paste.rs
Outdated
| translate!("paste-error-delimiter-unescaped-backslash", "delimiters" => delimiters), | ||
| )); | ||
| fn parse_delimiters(delimiters: &OsString) -> UResult<Box<[Box<[u8]>]>> { | ||
| let bytes = uucore::os_string_to_vec(delimiters.clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use OsStr::as_bytes() on Unix or borrowing instead of cloning the OsString ?
| _ => { | ||
| // Unknown escape: strip backslash, use the following character(s) | ||
| let len = mb_char_len(&bytes[i..]); | ||
| vec.push(Box::from(&bytes[i..i + len])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential panic if mb_char_len returns non-zero but bytes[i..] is shorter than returned length, no?
|
GNU testsuite comparison: |
This PR is fairly large in scope and has a few design decisions that probably warrant some additional discussion. I have been prototyping many approaches to how to deal with the issue of getting the multi-byte character sequence lengths and all of the solutions I could think of that would call the libc implementation ultimately would take a bunch of unsafe code and require FFI's and still create compatibility issues when it comes to different platforms not supporting that API.
To work backwards from what the goal was, I went through all of the glibc locales to see which ones actually had multi-byte decoding rules that were unique from UTF-8 and how many of those overlapped with one another. That led to to discover that currently there are only 5 different rule sets for this calculation and that you can determine which encoding to use from the env variables for which locale to use. Then it appears that some of the locales are hardcoded to specific encodings and the only two that I could find when searching all of the glibc locales was
"zh_CN" | "zh_SG"and"zh_TW" | "zh_HK"There were two new GNU tests that were related to this type of locale stuff: paste.pl and multi-byte.sh that this PR addresses.
This can help with the other issues in the queue:
#10184
#9712
#7600
#3075
#5831
#3997