Skip to content
This repository was archived by the owner on Feb 27, 2023. It is now read-only.

unicode support#42

Closed
rapropos wants to merge 1 commit intosquare:masterfrom
rapropos:unicode
Closed

unicode support#42
rapropos wants to merge 1 commit intosquare:masterfrom
rapropos:unicode

Conversation

@rapropos
Copy link
Copy Markdown
Contributor

@rapropos rapropos commented Aug 4, 2016

Allow non-ASCII strings to be passed through encryption/decryption properly via UTF-8 encoding. Addresses #41.

@csstaub
Copy link
Copy Markdown
Contributor

csstaub commented Aug 4, 2016

This seems good; but I'm not an expert on JavaScript. Our resident JS expert @alokmenghrajani is on vacation right now, but maybe I can find someone else to review it.

@rapropos
Copy link
Copy Markdown
Contributor Author

rapropos commented Aug 4, 2016

If it is accepted, I will make a stab at tackling #34 (as the necessary logic for my use case relies on proper UTF-8 handling of JWT payloads).

Comment thread lib/jose-utils.js
* https://github.com/google/closure-library
* @param str string
* @return Uint8Array
*/
Copy link
Copy Markdown
Contributor

@alokmenghrajani alokmenghrajani Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling encodeURIComponent() and walking the resulting string looking for '%' characters? I feel it's less error-prone to leverage code which already exists.

@steike points out that encodeURIComponent gives you some strictness (e.g. disallows \ud800 by itself) which is probably a good thing.

@rapropos
Copy link
Copy Markdown
Contributor Author

rapropos commented Sep 7, 2016

I guess I simply figured that if Google bothered to implement this in their fundamental JavaScript library underpinning all their core web apps, they must have run into some problem with methods using browser functions. If it matters, my use case is large amounts of Japanese text, so I'm not dealing with "mostly ASCII with an occasional accented character", so one issue might be how much encodeURIComponent bloats the strings.

@alokmenghrajani
Copy link
Copy Markdown
Contributor

Unless you can show that this version is significantly faster than calling encodeURIComponent, I would prefer to use encodeURIComponent/escape. Less code == less bugs.

You might also want to look at https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/btoa#Unicode_strings for the base64 part.

Comment thread lib/jose-utils.js
out[c++] = String.fromCharCode(c1);
} else if (c1 > 191 && c1 < 224) {
c2 = bytes[pos++];
out[c++] = String.fromCharCode((c1 & 31) << 6 | c2 & 63);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 365 constant is incorrect. Looking at https://en.wikipedia.org/wiki/UTF-8, the values we care about are:
192, 224, 240, 248, etc. (The values above 247 don't exist in javascript)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 365 constant seems to have come in in this commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, still doesn't explain why they picked this value. If c1 is a byte, it makes no sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incoming parameter is documented as {Uint8Array|Array<number>}, so perhaps in the latter case c1 isn't guaranteed to be <256.

@rapropos
Copy link
Copy Markdown
Contributor Author

rapropos commented Sep 7, 2016

The Mozilla link you mentioned has a note saying "A better, more faithful and less expensive solution is to use typed arrays to do the conversion". I followed that "typed arrays" link and the implementation in this PR is intended to be the "solution #2" from there.

@alokmenghrajani
Copy link
Copy Markdown
Contributor

alokmenghrajani commented Sep 7, 2016

This isn't a formal benchmark, but it seems calling atob gives the same result (if we call String.fromCharCode() first) and is faster for me (Chrome): https://jsfiddle.net/evcesa5h/.

Would you be ok taking the simpler/much less code approach first and switching to what you have authored if we run into issues?

@rapropos
Copy link
Copy Markdown
Contributor Author

rapropos commented Sep 8, 2016

There are fundamentally four operations that need to exist: Unicode string to/from array of UTF-8 bytes and array of arbitrary bytes to/from base64 string. For end user convenience, it would be nice to have single entry points that can go from Unicode string to/from base64 string. Maybe btoa can safely do the job of what I have as Utils.Base64Url.encodeArray, as in your fiddle.

Bottom line: all I really care about is reliably getting Unicode strings back exactly as they were, regardless of whether the producer and consumer are operating in different browser environments. I understand and sympathize with your concerns about "too much code", but also think there is some benefit to minimizing exposure to browser compatibility (and realistically don't have the resources to test a bunch of different browsers and operating systems).

@alokmenghrajani
Copy link
Copy Markdown
Contributor

This PR has bugs (e.g. \ud800 by itself, the 365 should probably be 248, allows overlongs, etc.). I don't think it's worth putting more time to fix these issue. I doubt escape/encodeURIComponent needs much cross-browser testing, it's been around for ages.

alokmenghrajani added a commit that referenced this pull request Sep 15, 2016
Jose spec mentions that the plaintext should be encoded as utf-8 bytes. Javascript by
default uses utf-16, so we need to convert when encrypting/decrypting/signing.

This commit is based on the work by rapropos (see #42).
I prefer to minimize the code size by re-using existing string manipulation functions.
@alokmenghrajani alokmenghrajani mentioned this pull request Sep 15, 2016
2 tasks
@alokmenghrajani
Copy link
Copy Markdown
Contributor

see #46

alokmenghrajani added a commit that referenced this pull request Sep 15, 2016
Jose spec mentions that the plaintext should be encoded as utf-8 bytes. Javascript by
default uses utf-16, so we need to convert when encrypting/decrypting/signing.

This commit is based on the work by rapropos (see #42).
I prefer to minimize the code size by re-using existing string manipulation functions.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants