Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode issue #10

Closed
msmuenchen opened this issue Jan 24, 2014 · 11 comments
Closed

Unicode issue #10

msmuenchen opened this issue Jan 24, 2014 · 11 comments

Comments

@msmuenchen
Copy link

Hi,

I'm having problems using rusha for comparing a string in Javascript with the same string hashed in PHP.

In Javascript, I use

var sha=new Rusha(); sha.digest("\u00e4")
"7e5c0f7aba32cf3e22fd30c4513a21e6d1c3aeff"

and in PHP (once using a literal ä, once a json-decode'd ä to rule out a bug in PHP or my file encoding)

$c1="ä";
$c2=json_decode("\"\\u00e4\"");
echo "1: -$c1- 2: -$c2-\n";
echo json_encode($c1)."\n";
echo json_encode($c2)."\n";

echo sha1($c1)."\n";
echo sha1($c2)."\n";

which gives me the output

1: -ä- 2: -ä-
"\u00e4"
"\u00e4"
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f

Why are the SHA1 hashes different? After all, using the \u00e4 notation should result in the same byte sequence both in a PHP string and a Javascript string, right?

@msmuenchen
Copy link
Author

Found out the reason: JS strings are UTF16-stored, while PHP assumes multi-byte with UTF8. Fix is easy with the library at http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt; I described the usage in http://stackoverflow.com/questions/19835609/differing-sha1-hashes-for-identical-values-on-the-server-and-the-client/21341088#21341088 where someone had a similar issue.

Might be worth to incorporate this conversion into the digest function?

@srijs
Copy link
Owner

srijs commented Jan 25, 2014

Yes, it might be worth adding an encoding parameter to the digest method, which would be evaluated in the conversion function.

Would you like to make the change and submit a PR?

@msmuenchen
Copy link
Author

I'm not that deep into JS, can you please do it?

@srijs
Copy link
Owner

srijs commented Jan 25, 2014

I'm a bit short on time at the moment, but I'll see if I can get around to it sometime next week.

Anyway, thanks for pointing that out!

@stuartpb
Copy link

I'll submit a patch that runs unescape(encodeURIComponent(str)) on the string before interpreting it (this converts the string to its equivalent UTF-8 character codes).

@stuartpb
Copy link

Where exactly would I insert that? https://github.com/srijs/rusha/blob/master/rusha.js#L164 looks like a good candidate.

@srijs
Copy link
Owner

srijs commented Sep 15, 2014

Hi.

Please modify rusha.sweet.js. A good candidate would be the rawDigest method. It could take an optional options parameter, where you can opt-in to the unescape(encodeURIComponent(str)) conversion.

@sergeevabc
Copy link

var r = new Rusha(); alert(r.digest("любовь"));

af48c12732ffdbd4299b792c2b6da6f77a0898d7 expected (works with jsSHA, CryptoJS, JSHash)
09c65cdd36ba4e6d767cde9acc71dfa75380655c rusha :(

Could be so kind and fix UTF8 issue at last?

@szydan
Copy link

szydan commented Apr 6, 2016

@sergeevabc in case you still need it - from the documentation (readme)
"Create a hex digest from a binary String. A binary string is expected to only contain characters whose charCode < 256"

So the library will not work on arbitrary strings
The workaround I found for your case is to first convert your utf-8 encoded string to byte array and then pass it to rusha. See the code below:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6),
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18),
                      0x80 | ((charcode>>12) & 0x3f),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

var r = new Rusha();
var s = "любовь"
var a = toUTF8Array(s)
console.log(r.digest(a));  //will give you the correct sha1 af48c12732ffdbd4299b792c2b6da6f77a0898d7

@sergeevabc
Copy link

sergeevabc commented Apr 17, 2016

Thanks for your input, @szydan. At that time I chose Fast SHA256.

@srijs
Copy link
Owner

srijs commented Jun 21, 2016

Closing this as wontfix -- Rusha is not meant to be used directly on encoded strings with code-points above 255. If you want to hash strings like these, please be sure to convert them into the desired binary encoding beforehand.

@srijs srijs closed this as completed Jun 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants