unicode support#42
Conversation
|
This seems good; but I'm not an expert on JavaScript. Our resident JS expert @alokmenghrajani is on vacation right now, but maybe I can find someone else to review it. |
|
If it is accepted, I will make a stab at tackling #34 (as the necessary logic for my use case relies on proper UTF-8 handling of JWT payloads). |
| * https://github.com/google/closure-library | ||
| * @param str string | ||
| * @return Uint8Array | ||
| */ |
There was a problem hiding this comment.
What do you think about calling encodeURIComponent() and walking the resulting string looking for '%' characters? I feel it's less error-prone to leverage code which already exists.
@steike points out that encodeURIComponent gives you some strictness (e.g. disallows \ud800 by itself) which is probably a good thing.
|
I guess I simply figured that if Google bothered to implement this in their fundamental JavaScript library underpinning all their core web apps, they must have run into some problem with methods using browser functions. If it matters, my use case is large amounts of Japanese text, so I'm not dealing with "mostly ASCII with an occasional accented character", so one issue might be how much |
|
Unless you can show that this version is significantly faster than calling encodeURIComponent, I would prefer to use encodeURIComponent/escape. Less code == less bugs. You might also want to look at https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/btoa#Unicode_strings for the base64 part. |
| out[c++] = String.fromCharCode(c1); | ||
| } else if (c1 > 191 && c1 < 224) { | ||
| c2 = bytes[pos++]; | ||
| out[c++] = String.fromCharCode((c1 & 31) << 6 | c2 & 63); |
There was a problem hiding this comment.
I think the 365 constant is incorrect. Looking at https://en.wikipedia.org/wiki/UTF-8, the values we care about are:
192, 224, 240, 248, etc. (The values above 247 don't exist in javascript)
There was a problem hiding this comment.
The 365 constant seems to have come in in this commit.
There was a problem hiding this comment.
Yeah, still doesn't explain why they picked this value. If c1 is a byte, it makes no sense.
There was a problem hiding this comment.
The incoming parameter is documented as {Uint8Array|Array<number>}, so perhaps in the latter case c1 isn't guaranteed to be <256.
|
The Mozilla link you mentioned has a note saying "A better, more faithful and less expensive solution is to use typed arrays to do the conversion". I followed that "typed arrays" link and the implementation in this PR is intended to be the "solution #2" from there. |
|
This isn't a formal benchmark, but it seems calling atob gives the same result (if we call String.fromCharCode() first) and is faster for me (Chrome): https://jsfiddle.net/evcesa5h/. Would you be ok taking the simpler/much less code approach first and switching to what you have authored if we run into issues? |
|
There are fundamentally four operations that need to exist: Unicode string to/from array of UTF-8 bytes and array of arbitrary bytes to/from base64 string. For end user convenience, it would be nice to have single entry points that can go from Unicode string to/from base64 string. Maybe Bottom line: all I really care about is reliably getting Unicode strings back exactly as they were, regardless of whether the producer and consumer are operating in different browser environments. I understand and sympathize with your concerns about "too much code", but also think there is some benefit to minimizing exposure to browser compatibility (and realistically don't have the resources to test a bunch of different browsers and operating systems). |
|
This PR has bugs (e.g. \ud800 by itself, the 365 should probably be 248, allows overlongs, etc.). I don't think it's worth putting more time to fix these issue. I doubt escape/encodeURIComponent needs much cross-browser testing, it's been around for ages. |
Jose spec mentions that the plaintext should be encoded as utf-8 bytes. Javascript by default uses utf-16, so we need to convert when encrypting/decrypting/signing. This commit is based on the work by rapropos (see #42). I prefer to minimize the code size by re-using existing string manipulation functions.
|
see #46 |
Jose spec mentions that the plaintext should be encoded as utf-8 bytes. Javascript by default uses utf-16, so we need to convert when encrypting/decrypting/signing. This commit is based on the work by rapropos (see #42). I prefer to minimize the code size by re-using existing string manipulation functions.
Allow non-ASCII strings to be passed through encryption/decryption properly via UTF-8 encoding. Addresses #41.