Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't think MultiByteToWideChar and WideCharToMultiByte need to add 1 to num_bytes #111

Closed
DavisVaughan opened this issue Dec 4, 2023 · 4 comments
Assignees
Labels
bindings Something with the low-level WinAPI bidings enhancement New feature or request

Comments

@DavisVaughan
Copy link

DavisVaughan commented Dec 4, 2023

I believe these +1 additions to num_bytes are not required, and are actually resulting in strings that are too long:
https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1483
https://rodrigocfd.github.io/winsafe/src/winsafe/kernel/funcs.rs.html#1725


  • MultiByteToWideChar() takes a multi_byte_str: &[u8] and gets its size with multi_byte_str.len()
  • WideCharToMultiByte() takes a wide_char_str: &[u16] and gets its size with wide_char_str.len()

So for the inputs to those functions, the nul terminator should already be included in the length if there is one.


And then for the output, I'm looking at:
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar#parameters
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte#parameters


For MultiByteToWideChar:

For cbMultiByte (only included the "positive integer" case):

Size, in bytes, of the string indicated by the lpMultiByteStr parameter. ... If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character.

For cchWideChar:

Size, in characters, of the buffer indicated by lpWideCharStr. If this value is 0, the function returns the required buffer size, in characters, including any terminating null character, and makes no use of the lpWideCharStr buffer.


For WideCharToMultiByte:

For cchWideChar (only included the "positive integer" case):

Size, in characters, of the string indicated by lpWideCharStr. ... If this parameter is set to a positive integer, the function processes exactly the specified number of characters. If the provided size does not include a terminating null character, the resulting character string is not null-terminated, and the returned length does not include this character.

For cbMultiByte:

Size, in bytes, of the buffer indicated by lpMultiByteStr. If this value is 0, the function returns the required buffer size, in bytes, including any terminating null character, and makes no use of the lpMultiByteStr buffer.


This documentation makes me think that the size returned from these functions when passing a null buffer already includes the nul terminator if one is present in the string, so adding 1 ends up adding extra space.

Indeed that seems to be the case in my testing as well. i.e. roundtripping an x containing something like "hi there" through this helper will result in a string with 2 extra \0 on the end. I manually removed the +1 and things start looking as I'd expect.

pub fn system_to_utf8(x: &[u8], code_page: CP) -> anyhow::Result<String> {
    let flags = MBC::NoValue;

    let x = MultiByteToWideChar(code_page, flags, x)?;

    let flags = unsafe { WC::from_raw(0) };
    let default_char = None;
    let used_default_char = None;

    let x = WideCharToMultiByte(CP::UTF8, flags, &x, default_char, used_default_char)?;

    let x = String::from_utf8(x)?;

    Ok(x)
}
@rodrigocfd rodrigocfd self-assigned this Dec 4, 2023
@rodrigocfd rodrigocfd added enhancement New feature or request bindings Something with the low-level WinAPI bidings labels Dec 4, 2023
@rodrigocfd
Copy link
Owner

I'm aware of the docs, but since these functions are at the core of the library, I was just over precautious of that extra null room... and that's because Win32 functions don't follow a standard: some of them include null in the computation, while others don't. So I chose not to take any chances.

Also: what happens if the input buffer doesn't have a terminating null?

@DavisVaughan
Copy link
Author

DavisVaughan commented Dec 5, 2023

Win32 functions don't follow a standard: some of them include null in the computation, while others don't

IIUC, the null behavior here is very well documented in the docs I copied in above. I definitely understand that the Windows API is the wild west, but I do think the behavior is stable here if I am reading it correctly!

FWIW, we've had to use these functions quite a bit when working on Windows with R and we've never had issues with needing the extra +1
https://github.com/search?q=org%3Ar-lib%20MultiByteToWideChar&type=code

Our main IDE, RStudio, also uses them this way:
https://github.com/rstudio/rstudio/blob/63be0dd74313574df1fc69e763c9824f232b521e/src/cpp/core/StringUtils.cpp#L349-L384

The C++ library, <date>, which was accepted into C++20 also uses it:
https://github.com/HowardHinnant/date/blob/cc4685a21e4a4fdae707ad1233c61bbaff241f93/src/tz.cpp#L201-L228


what happens if the input buffer doesn't have a terminating null

That is also well documented in what i copied in from above

If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character

And in fact that is often the case for the RStudio IDE. On Windows we can get a crazy string like <system><UTF8marker><UTF8><UTF8markerend><system> where:

  • <system> is a chunk of the string that is system coded
  • <UTF8marker><UTF8><UTF8markerend> defines boundaries where <UTF8> is a chunk of the string that can be assumed to already be UTF8 encoded.

(Yes, it is crazy)

We end up having to take a string slice to extract out the <system> bits and use a helper like the one above to convert them to UTF8, and in that case it won't be nul terminated, but that's totally fine. The len() method from Rust tells the windows functions how much of the string to look at, and the result won't be nul terminated either (again, a good thing for us since this is just a piece of the whole string)

@rodrigocfd
Copy link
Owner

Very interesting stuff, thank you. I removed that +1 and I couldn't find any problem, indeed.

@DavisVaughan
Copy link
Author

I really appreciate that! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bindings Something with the low-level WinAPI bidings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants