Unicode issue #10

msmuenchen · 2014-01-24T19:01:12Z

Hi,

I'm having problems using rusha for comparing a string in Javascript with the same string hashed in PHP.

In Javascript, I use

var sha=new Rusha(); sha.digest("\u00e4")
"7e5c0f7aba32cf3e22fd30c4513a21e6d1c3aeff"

and in PHP (once using a literal ä, once a json-decode'd ä to rule out a bug in PHP or my file encoding)

$c1="ä";
$c2=json_decode("\"\\u00e4\"");
echo "1: -$c1- 2: -$c2-\n";
echo json_encode($c1)."\n";
echo json_encode($c2)."\n";

echo sha1($c1)."\n";
echo sha1($c2)."\n";

which gives me the output

1: -ä- 2: -ä-
"\u00e4"
"\u00e4"
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f

Why are the SHA1 hashes different? After all, using the \u00e4 notation should result in the same byte sequence both in a PHP string and a Javascript string, right?

The text was updated successfully, but these errors were encountered:

msmuenchen · 2014-01-24T19:41:36Z

Found out the reason: JS strings are UTF16-stored, while PHP assumes multi-byte with UTF8. Fix is easy with the library at http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt; I described the usage in http://stackoverflow.com/questions/19835609/differing-sha1-hashes-for-identical-values-on-the-server-and-the-client/21341088#21341088 where someone had a similar issue.

Might be worth to incorporate this conversion into the digest function?

srijs · 2014-01-25T00:58:12Z

Yes, it might be worth adding an encoding parameter to the digest method, which would be evaluated in the conversion function.

Would you like to make the change and submit a PR?

msmuenchen · 2014-01-25T18:23:52Z

I'm not that deep into JS, can you please do it?

srijs · 2014-01-25T18:32:10Z

I'm a bit short on time at the moment, but I'll see if I can get around to it sometime next week.

Anyway, thanks for pointing that out!

stuartpb · 2014-09-14T14:23:24Z

I'll submit a patch that runs unescape(encodeURIComponent(str)) on the string before interpreting it (this converts the string to its equivalent UTF-8 character codes).

stuartpb · 2014-09-14T14:27:04Z

Where exactly would I insert that? https://github.com/srijs/rusha/blob/master/rusha.js#L164 looks like a good candidate.

srijs · 2014-09-15T07:56:33Z

Hi.

Please modify rusha.sweet.js. A good candidate would be the rawDigest method. It could take an optional options parameter, where you can opt-in to the unescape(encodeURIComponent(str)) conversion.

sergeevabc · 2014-10-14T01:23:31Z

var r = new Rusha(); alert(r.digest("любовь"));

af48c12732ffdbd4299b792c2b6da6f77a0898d7 expected (works with jsSHA, CryptoJS, JSHash)
09c65cdd36ba4e6d767cde9acc71dfa75380655c rusha :(

Could be so kind and fix UTF8 issue at last?

szydan · 2016-04-06T13:35:47Z

@sergeevabc in case you still need it - from the documentation (readme)
"Create a hex digest from a binary String. A binary string is expected to only contain characters whose charCode < 256"

So the library will not work on arbitrary strings
The workaround I found for your case is to first convert your utf-8 encoded string to byte array and then pass it to rusha. See the code below:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6),
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18),
                      0x80 | ((charcode>>12) & 0x3f),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

var r = new Rusha();
var s = "любовь"
var a = toUTF8Array(s)
console.log(r.digest(a));  //will give you the correct sha1 af48c12732ffdbd4299b792c2b6da6f77a0898d7

sergeevabc · 2016-04-17T10:26:36Z

Thanks for your input, @szydan. At that time I chose Fast SHA256.

srijs · 2016-06-21T14:58:43Z

Closing this as wontfix -- Rusha is not meant to be used directly on encoded strings with code-points above 255. If you want to hash strings like these, please be sure to convert them into the desired binary encoding beforehand.

srijs closed this as completed Jun 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode issue #10

Unicode issue #10

msmuenchen commented Jan 24, 2014

msmuenchen commented Jan 24, 2014

srijs commented Jan 25, 2014

msmuenchen commented Jan 25, 2014

srijs commented Jan 25, 2014

stuartpb commented Sep 14, 2014

stuartpb commented Sep 14, 2014

srijs commented Sep 15, 2014

sergeevabc commented Oct 14, 2014

szydan commented Apr 6, 2016

sergeevabc commented Apr 17, 2016 •

edited

Loading

srijs commented Jun 21, 2016

Unicode issue #10

Unicode issue #10

Comments

msmuenchen commented Jan 24, 2014

msmuenchen commented Jan 24, 2014

srijs commented Jan 25, 2014

msmuenchen commented Jan 25, 2014

srijs commented Jan 25, 2014

stuartpb commented Sep 14, 2014

stuartpb commented Sep 14, 2014

srijs commented Sep 15, 2014

sergeevabc commented Oct 14, 2014

szydan commented Apr 6, 2016

sergeevabc commented Apr 17, 2016 • edited Loading

srijs commented Jun 21, 2016

sergeevabc commented Apr 17, 2016 •

edited

Loading