-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
J - Encodingencoding (UTF-8, UTF-16) related issueencoding (UTF-8, UTF-16) related issueU - wcreported-canonical
Description
Component
wc -m
Description
GNU wc checks MB_CUR_MAX to determine whether to count bytes or multibyte characters. When MB_CUR_MAX == 1 (C/POSIX locale), it treats each byte as a character.
static bool
wc (int fd, char const *file_x, struct fstatus *fstatus)
{
// [...]
/* If in the current locale, chars are equivalent to bytes, we prefer
counting bytes, because that's easier. */
! if (MB_CUR_MAX > 1)
{
count_bytes = print_bytes;
count_chars = print_chars;
}
else
{
count_bytes = print_bytes || print_chars;
count_chars = false;
}
However, uutils wc ignores locale and always counts UTF-8 characters using bytecount::num_chars().
pub(crate) fn count_bytes_chars_and_lines_fast<
// [...]
>(
handle: &mut R,
) -> (WordCount, Option<io::Error>) {
let mut total = WordCount::default();
let buf: &mut [u8] = &mut AlignedBuffer::default().data;
loop {
match handle.read(buf) {
Ok(0) => return (total, None),
Ok(n) => {
if COUNT_BYTES {
total.bytes += n;
}
if COUNT_CHARS {
! total.chars += bytecount::num_chars(&buf[..n]);
}
if COUNT_LINES {
total.lines += bytecount::count(&buf[..n], b'\n');
}
}
Err(ref e) if e.kind() == ErrorKind::Interrupted => (),
Err(e) => return (total, Some(e)),
}
}
}
Test / Reproduction Steps
$ echo -n "한글"|LC_ALL=C wc -m
6
$ echo -n "한글"|LC_ALL=C coreutils wc -m
2Impact
wc -m produces different output than GNU in C/POSIX locale environments, breaking compatibility for scripts and CI pipelines that rely on locale-dependent character counting.
Recommendations
Check MB_CUR_MAX (or equivalent locale detection in rust) before counting characters. If MB_CUR_MAX == 1, return byte count instead of UTF-8 character count to match GNU behavior.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
J - Encodingencoding (UTF-8, UTF-16) related issueencoding (UTF-8, UTF-16) related issueU - wcreported-canonical