Skip to content

wc -m returns character count instead of byte count in C/POSIX locale #9712

@sylvestre

Description

@sylvestre

Component

wc -m

Description

GNU wc checks MB_CUR_MAX to determine whether to count bytes or multibyte characters. When MB_CUR_MAX == 1 (C/POSIX locale), it treats each byte as a character.

src/wc.c

static bool
wc (int fd, char const *file_x, struct fstatus *fstatus)
{
  // [...]
  /* If in the current locale, chars are equivalent to bytes, we prefer
     counting bytes, because that's easier.  */
!  if (MB_CUR_MAX > 1)
    {
      count_bytes = print_bytes;
      count_chars = print_chars;
    }
  else
    {
      count_bytes = print_bytes || print_chars;
      count_chars = false;
    }

However, uutils wc ignores locale and always counts UTF-8 characters using bytecount::num_chars().

src/uu/wc/src/count_fast.rs

pub(crate) fn count_bytes_chars_and_lines_fast<
// [...]
>(
    handle: &mut R,
) -> (WordCount, Option<io::Error>) {
    let mut total = WordCount::default();
    let buf: &mut [u8] = &mut AlignedBuffer::default().data;
    loop {
        match handle.read(buf) {
            Ok(0) => return (total, None),
            Ok(n) => {
                if COUNT_BYTES {
                    total.bytes += n;
                }
                if COUNT_CHARS {
!                    total.chars += bytecount::num_chars(&buf[..n]);
                }
                if COUNT_LINES {
                    total.lines += bytecount::count(&buf[..n], b'\n');
                }
            }
            Err(ref e) if e.kind() == ErrorKind::Interrupted => (),
            Err(e) => return (total, Some(e)),
        }
    }
}

Test / Reproduction Steps

$ echo -n "한글"|LC_ALL=C wc -m
6
$ echo -n "한글"|LC_ALL=C coreutils wc -m
2

Impact

wc -m produces different output than GNU in C/POSIX locale environments, breaking compatibility for scripts and CI pipelines that rely on locale-dependent character counting.

Recommendations

Check MB_CUR_MAX (or equivalent locale detection in rust) before counting characters. If MB_CUR_MAX == 1, return byte count instead of UTF-8 character count to match GNU behavior.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions