Challenge: store 4000 characters in 4000 bytes #5

xem · 2013-12-20T09:46:51Z

Yesterday, @p01 noticed a (huge) problem about JSciissors:

the compression reduces the number of characters but increase the final size in bytes.

Of course, this side effect was not intentional.

In fact, we already made compressors based on the number of characters, which is way more efficient than JSciissors:
https://gist.github.com/xem/7086007 => 2 chars in one
https://gist.github.com/xem/7584765 => 2.85 chars in one

But the goal of JSciissors wasn't to do that. It was only to strip the eightth bit of ascii chars (which is always "0"), concatenate their binary code, cut it in chunks of 8 bytes, and make a 8-bit character from it.

Turns out, this failed, because if we save the compressed code in a UTF-8 file, all non-ascii characters are stored on 2 bytes (or more).

But all hope is not lost, because the encoded file can be saved as ANSI a.k.a latin-1 a.k.a iso-8859-1, and if we do that, each character is really stored on 8 bits.

Unfortunately, the decoding is broken because the 8-bit characters order is different between UTF-8 and in ASCII.

So the challenge is to generate an encrypted string which, when it's read in ANSI, has the right characters, (and the right binary code), in order to be correctly decoded.

Challenge accepted.

cc. @aemkei @subzey

xem · 2013-12-21T08:32:21Z

So, I kept searching what was going on...

If you encode alert(123), the encoded string is 'Ã³/.��²f¤S and the decoded string is still alert(123).

The problem is that if you save this in UTF-8, this string takes more bytes than its number of characters.

So the idea is to save it in a iso-8859-1 (latin-1) file, since both charsets have the same first 256 characters, and latin1 stores each of them on 8 bits.

Unfortunately, when we decode 'Ã³/.��²f¤S in latin-1, the decoded string is alers�123).

So close! But what happened?

In the binary code, the 5th block of 8 bits has changed during the conversion:

encoded: 11000011 10110011 00101111 00101110 10001010 00011000 10110010 01100110 10100101...
decoded: 11000011 10110011 00101111 00101110 01100000 00011000 10110010 01100110 10100101...

The big question is: why did this character 0x8A become a character 0x70 ?

...

xem · 2013-12-30T19:24:02Z

Maybe @BananaScript can help with that, he had a similar issue... :)

michaelbeckius · 2013-12-31T12:34:57Z

I don't have a solution unfortunately. For high ascii characters, charCodeAt() returns the unicode value for the character, not the latin1 value. This totally messes up the bits and is why decompression fails. For example, for the character with ascii code 0x8a, charCodeAt will return the value 352.

michaelbeckius · 2013-12-31T13:17:39Z

If anyone wants to try their luck: http://bay16.com/charcodeat/ view source for more.

xem · 2013-12-31T13:32:14Z

@BananaScript, thanks for your answer! I understand that the charCodeAt function can only output unicode.
Still, I was hoping to fix that, here are some ideas I have that could maybe work, but I need help to implement them. Can you please tell me what you think about them?

Solution 1: fix the encoder to handle latin-1

In the encoder, we have the binary code 0x8a to store in a character.
Unfortunately, if that character is written in a string in a latin-1 file, it is not decoded as 0x8a.
So the idea would be to output another character that will really be decoded as 0x8a.
The big challenge is to find that other character, and to generalize this fix for the 256 characters that can possibly exist in the string.

Solution 2: fix the decoder

The decoder gets the character 0x70 instead of 0x8A
So the idea would be to find all the 0x70 chars in the encoded string, and replace them with 0x8A (in binary) during the decoding. (and generalize that for all the buggy characters)

Solution 3: Use directly binary instead of a string

Instead of a string, that suffers from those charsets limitations, we could store the encoded JS in a binary file.
The decoder would just have to read this binary file with AJAX, and decode it to retrieve the original JS code.
I actually made a demo for that and it works. This 10.6kb HTML file reads its own binary code with AJAX, decodes it, and executes the decoded JS. The result is: console.log("...(a 12000 chars string)...").
That's a compression ratio near 10%!
Here it is: http://xem.github.io/JSciissors/test.html

xem · 2013-12-31T14:16:49Z

@BananaScript here's my implementation of the solution 2 for your test file:

var fixes = {352: 138}
for(i in a){
    b = a.charCodeAt(i);
    if(fixes[b]){
      b = fixes[b];
    }
    debug.value += "\r\n" + b;
}

The ideal would be to do implement the solution 1 (if it's possible), to keep the decoder as short as possible.

xem · 2013-12-31T14:39:26Z

Update: I made a full decoder corresponding to the solution 2. It fixes all the differences between latin-1 and unicode.

The source contains all latin1 chars from 0 to 255 and the decoder.
The results are in the console:

http://xem.github.io/JSciissors/fix.html

this solution works, but it adds about 250 chars to the decoder present on the JSciissors homepage...

michaelbeckius · 2013-12-31T16:10:41Z

How about this, just a little bit shorter (32 bytes or so):

http://bay16.com/charcodeat/fix2.html

I skipped the 0x00 char. It couldn't possibly be an issue?

xem · 2013-12-31T16:42:43Z

Nice job, but please take a look at my new version:

http://xem.github.io/JSciissors/fix.html

I managed to fit the unicode-to-latin1 thing in just 156 chars instead of 271. (see line 20)

Now THAT's code golfing :D

Feel free to do better though :)

PS: the char 0x00 is also taken care of. That's mandatory, because JSciissors can output it just as any other char.

michaelbeckius · 2013-12-31T17:30:05Z

Awesome!

xem · 2013-12-31T18:08:21Z

And 8 more bytes removed:
http://xem.github.io/JSciissors/fix.html

michaelbeckius · 2013-12-31T18:14:21Z

Impressive.
By the way, that issue that I had with Opera seems to be gone. Maybe it disappeared when they switched to webkit. At least one less problem to worry about. :)

xem · 2013-12-31T19:03:02Z

So... after a lot of maths, my final word is: 139 chars. Less than a tweet!

http://xem.github.io/JSciissors/fix.html

@p01 @aemkei @subzey @rlauck @bburky, I invite you to golf it even more if you can :)

For a reminder, the goal is to write a function that returns the charCode of a character in a latin-1 string.

Here are the differences with the regular charCodeAt function:

65533 => 0
8364 => 128
8218 => 130
402 => 131
8222 => 132
8230 => 133
8224 => 134
8225 => 135
710 => 136
8240 => 137
352 => 138
8249 => 139
338 => 140
381 => 142
8216 => 145
8217 => 146
8220 => 147
8221 => 148
8226 => 149
8211 => 150
8212 => 151
732 => 152
8482 => 153
353 => 154
8250 => 155
339 => 156
382 => 158
376 => 159

xem · 2013-12-31T21:19:37Z

@BananaScript was the bug you're talking about present on opera 12? I think it's still important to support that version.

michaelbeckius · 2014-01-01T10:33:25Z

For several years up until november 2013, there has been a test for this running for every visitor at http://bananascript.com and reporting back the results to the server. I checked the logs and the last version of Opera that had this problem was version 12.00. From version 12.01 and later, there has not been any more problems.

The problem was that Opera could not see a difference between the characters with ascii codes 129,141,143,144 and 157. They were all seen as the same charatcer by Opera with the same ascii code.

xem · 2014-01-01T11:12:53Z

Oh, cool! if everything works on opera 12.X (X>0) then it's fine for me. :)

xem · 2014-01-01T12:09:01Z

The issue is now fixed (but another one appeared...)

http://xem.github.io/JSciissors/

If you encode "alert(1)", and save the result in a latin-1 JS file, it is executed correctly!

If you encode jquery.min.js, it's screwed up... to be continued. #6

xem closed this as completed Jan 1, 2014

xem mentioned this issue May 26, 2014

use unicode #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge: store 4000 characters in 4000 bytes #5

Challenge: store 4000 characters in 4000 bytes #5

xem commented Dec 20, 2013

xem commented Dec 21, 2013

xem commented Dec 30, 2013

michaelbeckius commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

xem commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Jan 1, 2014

xem commented Jan 1, 2014

xem commented Jan 1, 2014

Challenge: store 4000 characters in 4000 bytes #5

Challenge: store 4000 characters in 4000 bytes #5

Comments

xem commented Dec 20, 2013

xem commented Dec 21, 2013

xem commented Dec 30, 2013

michaelbeckius commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

Solution 1: fix the encoder to handle latin-1

Solution 2: fix the decoder

Solution 3: Use directly binary instead of a string

xem commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Dec 31, 2013

xem commented Dec 31, 2013

xem commented Dec 31, 2013

michaelbeckius commented Jan 1, 2014

xem commented Jan 1, 2014

xem commented Jan 1, 2014