usb: allow UTF16LE characters in string descriptors #6649

IvanSanchez · 2018-03-16T13:44:41Z

Feature request:

I want to be able to write my name (Iván, with á) as UTF-8 string in the Kconfig files as part of CONFIG_USB_DEVICE_PRODUCT, and have it properly converted into a UTF-16-littleendian string in the USB descriptor structures.

~~This is currently blocked by ulfalizer/Kconfiglib#41 (AKA "UTF-8 support in Kconfiglib").~~ This depends on #6716 (AKA "force UTF-8 encoding on Kconfig files")

Note that right now the string descriptors are stored as ascii-7, and undergo a naive conversion into utf16le during the initialization of the USB subsystem, at

zephyr/subsys/usb/usb_descriptor.c

Line 644 in 7f28edc

void ascii7_to_utf16le(int idx_max, int asci_idx_max, u8_t *buf)

. I guess that the elegant solution would be to have C macros converting the UTF8 constants from the Kconfig files into u16 * constants containing non-null-terminated UTF16LE-encoded strings, and using those u16 *s in the descriptors.

Relates to #4661.

The text was updated successfully, but these errors were encountered:

ulfalizer · 2018-03-16T18:29:00Z

Hello,

As a note, Kconfiglib itself has no problem with UTF-8 in string symbols. The problem is an unfortunate default LANG=C locale setting in Docker, forcing an ASCII encoding.

I could force the encoding to UTF-8 via open(..., encoding="utf-8"), but it would cause iffiness for Python 2, requiring ugly workarounds. It would also make Kconfiglib behave in a nonstandard way, as most *nix tools respect the encoding as given in the environment.

All mainstream distributions sanely default to a UTF-8 locale (by setting the related environment variables), so I've never seen this issue until now.

Having something like this in the Dockerfile would address the root cause of the issue:

ENV LANG C.UTF-8

That's a good idea for Unicode support in general, as many other tools besides Python will use the encoding specified in the environment.

A quickfix for Python 3 is to add encoding="utf-8" to all the open() calls.

PEP 538 is related, and specifically talks about Docker. Googling "docker utf-8" indicates that the ASCII default has caused a lot of pain for other tools and languages as well (Java, SQL, Ruby, ...).

SebastianBoe · 2018-03-20T11:18:56Z

Hi @IvanSanchez , AFAICT this is not accurate:

"This is currently blocked by ulfalizer/Kconfiglib#41 (AKA "UTF-8 support in Kconfiglib")."

PR 41 is now abandoned in favour of #6716

If you are not observing any issues with UTF-8 in Kconfig comments, but are observing issues with UTF-8 in Kconfig values, then I would bet that the root cause is not related to PR 41 or #6716, but rather some other mechanism that hasn't been tested with UTF-8 yet.

IvanSanchez · 2018-03-20T11:24:04Z

Sorry about the confusion about ulfalizer/Kconfiglib#41 - I haven't been following that very closely. Will try to make some tests in the upcoming weeks and see if there's something blocking this.

ulfalizer · 2018-03-20T18:24:42Z

Should work fine with a UTF-8 locale, including for string values. I've done some testing locally.

Kconfiglib doesn't interpret text between quotes (except for escapes and the special values "n", "m", and "y"), so you get it for free from Python.

SebastianBoe · 2018-03-21T11:08:50Z

Thanks for the analysis @ulfalizer. I can also report that Kconfiglib is behaving as expected, but another piece of our infrastructure is not dealing with UTF8 values well.

Added a test for having UTF-8 characters in Kconfig values. This ensures that issue zephyrproject-rtos#6649 does not affect any supported platforms and that it does not re-appear in the future. Signed-off-by: Sebastian Bøe <sebastian.boe@nordicsemi.no>

Added a test for having UTF-8 characters in Kconfig values. This ensures that issue #6649 does not affect any supported platforms and that it does not re-appear in the future. Signed-off-by: Sebastian Bøe <sebastian.boe@nordicsemi.no>

SebastianBoe · 2018-03-22T08:49:10Z

Me and @IvanSanchez have been discussing this.

Essentially, the user wants to be able to input strings through some build-time interface,
like Kconfig, and have them stored as rodata in his application with an explicitly given encoding.

He does NOT want to do runtime conversion, e.g. from UTF-8 to UTF-16-little-endian, for performance reasons.

And he does NOT want to use the 'gen_inc_file' to include binary data in his application, because it is inconvenient/bad usability to be saving this UTF string in a dedicated file.

I'm not sure what the cleanest solution would be. Perhaps Kconfig could have an "encoding" attribute for strings.

ulfalizer · 2018-03-22T12:22:31Z

One thing that gets a bit messy is that you might end up with .config files containing multiple different encodings.

Maybe escapes could be used (CONFIG_FOO="I\\x00v\\x00\\xe1\\x00n\\x00", from "Iván".encode("utf-16-le")), though that's a bit annoying to input by hand (Kconfig does escaping too, so that's why you get those double \\).

If an option encoding=... attribute was added, it could handle the conversion on the Kconfig file end, though it wouldn't help when manually editing configuration files.

IvanSanchez · 2018-03-22T12:35:58Z

If an option encoding=... attribute was added, it could handle the conversion on the Kconfig file end, though it wouldn't help when manually editing configuration files.

That would depend on the semantics of the option encoding=, IMHO. Let all strings be in whatever encoding the Kconfig file is, and pass the responsibility of the actual re-encoding to the toolchain.

I'm worried that other implementations of Kconfig parsers might choke on "non-standard" option fields, though. But once again, this might be the path of least resistance.

ulfalizer · 2018-03-22T12:39:09Z

Hmm... I guess those double \\ would get problematic for autoconf.h as well. Maybe option encoding=... could disable \\ escaping so you end up with valid C character escapes, though it's getting slightly hairy at that point.

ulfalizer · 2018-03-22T12:47:20Z

@IvanSanchez
The C tools give a Kconfig:17: warning: ignoring unknown option encoding. So it'd be broken, but at least you might not notice it immediately. ;)

That would depend on the semantics of the option encoding=, IMHO. Let all strings be in whatever encoding the Kconfig file is, and pass the responsibility of the actual re-encoding to the toolchain.

Yeah, I'm not sure whether Kconfig is the best place to handle this either.

SebastianBoe · 2018-03-22T13:00:06Z

Maybe some kind of CPP macro;

ENCODING_CONVERT_FROM_UTF8_TO_UTF_16_LITTLE_ENDIAN(CONFIG_USB_DESCRIPTOR)

which either does the conversion in CPP (If technically possible), or marks the string with an attribute, which is detected and processed in one of the linker passes.

Or we do this in the Kconfig GUI frontend, have the GUI convert to the ascii equivalent and have the underlying tools just see weirdly formatted ascii.

ulfalizer · 2018-03-22T13:02:06Z

Though maybe this could work:

Have .config files use the default encoding (UTF-8, hopefully) and only handle option encoding=... specially for autoconf.h, outputting #define CONFIG_FOO "I\x00v\x00\xe1\x00n\x00" or the like.

You can then easily edit configuration files by hand.

I could add that upstream if it turns out to be the nicest solution. Python 2/3 compatibility and Unicode is always a bit of a pain, but it shouldn't get that bad.

IvanSanchez · 2018-03-22T13:09:02Z

Maybe some kind of CPP macro;

ENCODING_CONVERT_FROM_UTF8_TO_UTF_16_LITTLE_ENDIAN(CONFIG_USB_DESCRIPTOR)

I considered this approach with C macros, but the inability to make the preprocessor loop over characters of a constant string (plus my general lack of experience about C/C++) made me desist. I know libicu has macros for individual codepoints, but not for whole strings.

If I've overlooked something there, please prove me wrong - I'd love to see a UTF re-encoding implementation with pure macros.

IvanSanchez · 2018-03-23T08:23:24Z

I just saw #6762 - maybe that would help solve the encoding issue?

SebastianBoe · 2018-05-03T12:19:29Z

@ulfalizer : Did merging #7296 resolve this issue?

ulfalizer · 2018-05-08T12:57:19Z

@SebastianBoe
Depends... looks like this issue also deals with UTF-16 conversion. If that's handled outside of Kconfiglib, then it's solved at least.

Kconfiglib itself never had a problem with non-ASCII characters. It now also defaults to UTF-8 on Python 3 (see ulfalizer/Kconfiglib@da40c01).

ulfalizer · 2019-03-07T11:28:50Z

Closing for now. Feel free to reopen.

In retrospect, I think the best way to deal with obscure cases like this might be to require the bytes to be input individually or the like ("\x12\x73", etc.)

jfischer-no added Enhancement Changes/Updates/Additions to existing features area: USB Universal Serial Bus labels Mar 16, 2018

ulfalizer mentioned this issue Mar 18, 2018

kconfig: Remove UTF-8 character from author's name #6599

Merged

ulfalizer mentioned this issue Mar 21, 2018

setting locale breaks MacOS X builds #6726

Closed

SebastianBoe mentioned this issue Mar 21, 2018

Support utf8 in kconfig values #6732

Merged

carlescufi added the area: Configuration System label Mar 25, 2018

carlescufi added this to the v1.12.0 milestone Mar 25, 2018

carlescufi assigned ulfalizer Mar 26, 2018

SebastianBoe changed the title ~~usb: allow non-ascii characters in string descriptors~~ usb: allow UTF16LE characters in string descriptors May 8, 2018

nashif removed this from the v1.12.0 milestone Sep 5, 2018

ulfalizer closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usb: allow UTF16LE characters in string descriptors #6649

usb: allow UTF16LE characters in string descriptors #6649

IvanSanchez commented Mar 16, 2018 •

edited

ulfalizer commented Mar 16, 2018 •

edited

SebastianBoe commented Mar 20, 2018

IvanSanchez commented Mar 20, 2018

ulfalizer commented Mar 20, 2018 •

edited

SebastianBoe commented Mar 21, 2018 •

edited

SebastianBoe commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 •

edited

IvanSanchez commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 •

edited

ulfalizer commented Mar 22, 2018

SebastianBoe commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 •

edited

IvanSanchez commented Mar 22, 2018

IvanSanchez commented Mar 23, 2018

SebastianBoe commented May 3, 2018

ulfalizer commented May 8, 2018

ulfalizer commented Mar 7, 2019

usb: allow UTF16LE characters in string descriptors #6649

usb: allow UTF16LE characters in string descriptors #6649

Comments

IvanSanchez commented Mar 16, 2018 • edited

ulfalizer commented Mar 16, 2018 • edited

SebastianBoe commented Mar 20, 2018

IvanSanchez commented Mar 20, 2018

ulfalizer commented Mar 20, 2018 • edited

SebastianBoe commented Mar 21, 2018 • edited

SebastianBoe commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 • edited

IvanSanchez commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 • edited

ulfalizer commented Mar 22, 2018

SebastianBoe commented Mar 22, 2018

ulfalizer commented Mar 22, 2018 • edited

IvanSanchez commented Mar 22, 2018

IvanSanchez commented Mar 23, 2018

SebastianBoe commented May 3, 2018

ulfalizer commented May 8, 2018

ulfalizer commented Mar 7, 2019

IvanSanchez commented Mar 16, 2018 •

edited

ulfalizer commented Mar 16, 2018 •

edited

ulfalizer commented Mar 20, 2018 •

edited

SebastianBoe commented Mar 21, 2018 •

edited

ulfalizer commented Mar 22, 2018 •

edited

ulfalizer commented Mar 22, 2018 •

edited

ulfalizer commented Mar 22, 2018 •

edited