Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usb: allow UTF16LE characters in string descriptors #6649

Closed
IvanSanchez opened this issue Mar 16, 2018 · 17 comments
Closed

usb: allow UTF16LE characters in string descriptors #6649

IvanSanchez opened this issue Mar 16, 2018 · 17 comments
Assignees
Labels
area: Configuration System area: USB Universal Serial Bus Enhancement Changes/Updates/Additions to existing features

Comments

@IvanSanchez
Copy link
Contributor

IvanSanchez commented Mar 16, 2018

Feature request:

I want to be able to write my name (Iván, with á) as UTF-8 string in the Kconfig files as part of CONFIG_USB_DEVICE_PRODUCT, and have it properly converted into a UTF-16-littleendian string in the USB descriptor structures.

This is currently blocked by ulfalizer/Kconfiglib#41 (AKA "UTF-8 support in Kconfiglib"). This depends on #6716 (AKA "force UTF-8 encoding on Kconfig files")

Note that right now the string descriptors are stored as ascii-7, and undergo a naive conversion into utf16le during the initialization of the USB subsystem, at

void ascii7_to_utf16le(int idx_max, int asci_idx_max, u8_t *buf)
. I guess that the elegant solution would be to have C macros converting the UTF8 constants from the Kconfig files into u16 * constants containing non-null-terminated UTF16LE-encoded strings, and using those u16 *s in the descriptors.

Relates to #4661.

@jfischer-no jfischer-no added Enhancement Changes/Updates/Additions to existing features area: USB Universal Serial Bus labels Mar 16, 2018
@ulfalizer
Copy link
Collaborator

ulfalizer commented Mar 16, 2018

Hello,

As a note, Kconfiglib itself has no problem with UTF-8 in string symbols. The problem is an unfortunate default LANG=C locale setting in Docker, forcing an ASCII encoding.

I could force the encoding to UTF-8 via open(..., encoding="utf-8"), but it would cause iffiness for Python 2, requiring ugly workarounds. It would also make Kconfiglib behave in a nonstandard way, as most *nix tools respect the encoding as given in the environment.

All mainstream distributions sanely default to a UTF-8 locale (by setting the related environment variables), so I've never seen this issue until now.

Having something like this in the Dockerfile would address the root cause of the issue:

ENV LANG C.UTF-8

That's a good idea for Unicode support in general, as many other tools besides Python will use the encoding specified in the environment.

A quickfix for Python 3 is to add encoding="utf-8" to all the open() calls.

PEP 538 is related, and specifically talks about Docker. Googling "docker utf-8" indicates that the ASCII default has caused a lot of pain for other tools and languages as well (Java, SQL, Ruby, ...).

@SebastianBoe
Copy link
Collaborator

Hi @IvanSanchez , AFAICT this is not accurate:

"This is currently blocked by ulfalizer/Kconfiglib#41 (AKA "UTF-8 support in Kconfiglib")."

PR 41 is now abandoned in favour of #6716

If you are not observing any issues with UTF-8 in Kconfig comments, but are observing issues with UTF-8 in Kconfig values, then I would bet that the root cause is not related to PR 41 or #6716, but rather some other mechanism that hasn't been tested with UTF-8 yet.

@IvanSanchez
Copy link
Contributor Author

Sorry about the confusion about ulfalizer/Kconfiglib#41 - I haven't been following that very closely. Will try to make some tests in the upcoming weeks and see if there's something blocking this.

@ulfalizer
Copy link
Collaborator

ulfalizer commented Mar 20, 2018

Should work fine with a UTF-8 locale, including for string values. I've done some testing locally.

Kconfiglib doesn't interpret text between quotes (except for escapes and the special values "n", "m", and "y"), so you get it for free from Python.

@SebastianBoe
Copy link
Collaborator

SebastianBoe commented Mar 21, 2018

Thanks for the analysis @ulfalizer. I can also report that Kconfiglib is behaving as expected, but another piece of our infrastructure is not dealing with UTF8 values well.

SebastianBoe added a commit to SebastianBoe/zephyr that referenced this issue Mar 21, 2018
Added a test for having UTF-8 characters in Kconfig values. This
ensures that issue
zephyrproject-rtos#6649 does not
affect any supported platforms and that it does not re-appear in the
future.

Signed-off-by: Sebastian Bøe <sebastian.boe@nordicsemi.no>
nashif pushed a commit that referenced this issue Mar 21, 2018
Added a test for having UTF-8 characters in Kconfig values. This
ensures that issue
#6649 does not
affect any supported platforms and that it does not re-appear in the
future.

Signed-off-by: Sebastian Bøe <sebastian.boe@nordicsemi.no>
@SebastianBoe
Copy link
Collaborator

Me and @IvanSanchez have been discussing this.

Essentially, the user wants to be able to input strings through some build-time interface,
like Kconfig, and have them stored as rodata in his application with an explicitly given encoding.

He does NOT want to do runtime conversion, e.g. from UTF-8 to UTF-16-little-endian, for performance reasons.

And he does NOT want to use the 'gen_inc_file' to include binary data in his application, because it is inconvenient/bad usability to be saving this UTF string in a dedicated file.

I'm not sure what the cleanest solution would be. Perhaps Kconfig could have an "encoding" attribute for strings.

@ulfalizer
Copy link
Collaborator

ulfalizer commented Mar 22, 2018

One thing that gets a bit messy is that you might end up with .config files containing multiple different encodings.

Maybe escapes could be used (CONFIG_FOO="I\\x00v\\x00\\xe1\\x00n\\x00", from "Iván".encode("utf-16-le")), though that's a bit annoying to input by hand (Kconfig does escaping too, so that's why you get those double \\).

If an option encoding=... attribute was added, it could handle the conversion on the Kconfig file end, though it wouldn't help when manually editing configuration files.

@IvanSanchez
Copy link
Contributor Author

If an option encoding=... attribute was added, it could handle the conversion on the Kconfig file end, though it wouldn't help when manually editing configuration files.

That would depend on the semantics of the option encoding=, IMHO. Let all strings be in whatever encoding the Kconfig file is, and pass the responsibility of the actual re-encoding to the toolchain.

I'm worried that other implementations of Kconfig parsers might choke on "non-standard" option fields, though. But once again, this might be the path of least resistance.

@ulfalizer
Copy link
Collaborator

ulfalizer commented Mar 22, 2018

Hmm... I guess those double \\ would get problematic for autoconf.h as well. Maybe option encoding=... could disable \\ escaping so you end up with valid C character escapes, though it's getting slightly hairy at that point.

@ulfalizer
Copy link
Collaborator

@IvanSanchez
The C tools give a Kconfig:17: warning: ignoring unknown option encoding. So it'd be broken, but at least you might not notice it immediately. ;)

That would depend on the semantics of the option encoding=, IMHO. Let all strings be in whatever encoding the Kconfig file is, and pass the responsibility of the actual re-encoding to the toolchain.

Yeah, I'm not sure whether Kconfig is the best place to handle this either.

@SebastianBoe
Copy link
Collaborator

Maybe some kind of CPP macro;

ENCODING_CONVERT_FROM_UTF8_TO_UTF_16_LITTLE_ENDIAN(CONFIG_USB_DESCRIPTOR)

which either does the conversion in CPP (If technically possible), or marks the string with an attribute, which is detected and processed in one of the linker passes.

Or we do this in the Kconfig GUI frontend, have the GUI convert to the ascii equivalent and have the underlying tools just see weirdly formatted ascii.

@ulfalizer
Copy link
Collaborator

ulfalizer commented Mar 22, 2018

Though maybe this could work:

Have .config files use the default encoding (UTF-8, hopefully) and only handle option encoding=... specially for autoconf.h, outputting #define CONFIG_FOO "I\x00v\x00\xe1\x00n\x00" or the like.

You can then easily edit configuration files by hand.

I could add that upstream if it turns out to be the nicest solution. Python 2/3 compatibility and Unicode is always a bit of a pain, but it shouldn't get that bad.

@IvanSanchez
Copy link
Contributor Author

Maybe some kind of CPP macro;

ENCODING_CONVERT_FROM_UTF8_TO_UTF_16_LITTLE_ENDIAN(CONFIG_USB_DESCRIPTOR)

I considered this approach with C macros, but the inability to make the preprocessor loop over characters of a constant string (plus my general lack of experience about C/C++) made me desist. I know libicu has macros for individual codepoints, but not for whole strings.

If I've overlooked something there, please prove me wrong - I'd love to see a UTF re-encoding implementation with pure macros.

@IvanSanchez
Copy link
Contributor Author

I just saw #6762 - maybe that would help solve the encoding issue?

@SebastianBoe
Copy link
Collaborator

@ulfalizer : Did merging #7296 resolve this issue?

@ulfalizer
Copy link
Collaborator

@SebastianBoe
Depends... looks like this issue also deals with UTF-16 conversion. If that's handled outside of Kconfiglib, then it's solved at least.

Kconfiglib itself never had a problem with non-ASCII characters. It now also defaults to UTF-8 on Python 3 (see ulfalizer/Kconfiglib@da40c01).

@SebastianBoe SebastianBoe changed the title usb: allow non-ascii characters in string descriptors usb: allow UTF16LE characters in string descriptors May 8, 2018
@nashif nashif removed this from the v1.12.0 milestone Sep 5, 2018
@ulfalizer
Copy link
Collaborator

Closing for now. Feel free to reopen.

In retrospect, I think the best way to deal with obscure cases like this might be to require the bytes to be input individually or the like ("\x12\x73", etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Configuration System area: USB Universal Serial Bus Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

No branches or pull requests

6 participants