Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge: store 4000 characters in 4000 bytes #5

Closed
xem opened this issue Dec 20, 2013 · 17 comments
Closed

Challenge: store 4000 characters in 4000 bytes #5

xem opened this issue Dec 20, 2013 · 17 comments

Comments

@xem
Copy link
Owner

xem commented Dec 20, 2013

Yesterday, @p01 noticed a (huge) problem about JSciissors:

the compression reduces the number of characters but increase the final size in bytes.

Of course, this side effect was not intentional.

In fact, we already made compressors based on the number of characters, which is way more efficient than JSciissors:
https://gist.github.com/xem/7086007 => 2 chars in one
https://gist.github.com/xem/7584765 => 2.85 chars in one

But the goal of JSciissors wasn't to do that. It was only to strip the eightth bit of ascii chars (which is always "0"), concatenate their binary code, cut it in chunks of 8 bytes, and make a 8-bit character from it.

Turns out, this failed, because if we save the compressed code in a UTF-8 file, all non-ascii characters are stored on 2 bytes (or more).

But all hope is not lost, because the encoded file can be saved as ANSI a.k.a latin-1 a.k.a iso-8859-1, and if we do that, each character is really stored on 8 bits.

Unfortunately, the decoding is broken because the 8-bit characters order is different between UTF-8 and in ASCII.

So the challenge is to generate an encrypted string which, when it's read in ANSI, has the right characters, (and the right binary code), in order to be correctly decoded.

Challenge accepted.

cc. @aemkei @subzey

@xem
Copy link
Owner Author

xem commented Dec 21, 2013

So, I kept searching what was going on...

If you encode alert(123), the encoded string is 'ó/.��²f¤S and the decoded string is still alert(123).

The problem is that if you save this in UTF-8, this string takes more bytes than its number of characters.

So the idea is to save it in a iso-8859-1 (latin-1) file, since both charsets have the same first 256 characters, and latin1 stores each of them on 8 bits.

Unfortunately, when we decode 'ó/.��²f¤S in latin-1, the decoded string is alers�123).

So close! But what happened?

In the binary code, the 5th block of 8 bits has changed during the conversion:

encoded: 11000011 10110011 00101111 00101110 10001010 00011000 10110010 01100110 10100101...
decoded: 11000011 10110011 00101111 00101110 01100000 00011000 10110010 01100110 10100101...

The big question is: why did this character 0x8A become a character 0x70 ?

...

@xem
Copy link
Owner Author

xem commented Dec 30, 2013

Maybe @BananaScript can help with that, he had a similar issue... :)

@michaelbeckius
Copy link

I don't have a solution unfortunately. For high ascii characters, charCodeAt() returns the unicode value for the character, not the latin1 value. This totally messes up the bits and is why decompression fails. For example, for the character with ascii code 0x8a, charCodeAt will return the value 352.

@michaelbeckius
Copy link

If anyone wants to try their luck: http://bay16.com/charcodeat/ view source for more.

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

@BananaScript, thanks for your answer! I understand that the charCodeAt function can only output unicode.
Still, I was hoping to fix that, here are some ideas I have that could maybe work, but I need help to implement them. Can you please tell me what you think about them?

Solution 1: fix the encoder to handle latin-1

In the encoder, we have the binary code 0x8a to store in a character.
Unfortunately, if that character is written in a string in a latin-1 file, it is not decoded as 0x8a.
So the idea would be to output another character that will really be decoded as 0x8a.
The big challenge is to find that other character, and to generalize this fix for the 256 characters that can possibly exist in the string.

Solution 2: fix the decoder

The decoder gets the character 0x70 instead of 0x8A
So the idea would be to find all the 0x70 chars in the encoded string, and replace them with 0x8A (in binary) during the decoding. (and generalize that for all the buggy characters)

Solution 3: Use directly binary instead of a string

Instead of a string, that suffers from those charsets limitations, we could store the encoded JS in a binary file.
The decoder would just have to read this binary file with AJAX, and decode it to retrieve the original JS code.
I actually made a demo for that and it works. This 10.6kb HTML file reads its own binary code with AJAX, decodes it, and executes the decoded JS. The result is: console.log("...(a 12000 chars string)...").
That's a compression ratio near 10%!
Here it is: http://xem.github.io/JSciissors/test.html

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

@BananaScript here's my implementation of the solution 2 for your test file:

var fixes = {352: 138}
for(i in a){
    b = a.charCodeAt(i);
    if(fixes[b]){
      b = fixes[b];
    }
    debug.value += "\r\n" + b;
}

The ideal would be to do implement the solution 1 (if it's possible), to keep the decoder as short as possible.

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

Update: I made a full decoder corresponding to the solution 2. It fixes all the differences between latin-1 and unicode.

The source contains all latin1 chars from 0 to 255 and the decoder.
The results are in the console:

http://xem.github.io/JSciissors/fix.html

this solution works, but it adds about 250 chars to the decoder present on the JSciissors homepage...

@michaelbeckius
Copy link

How about this, just a little bit shorter (32 bytes or so):

http://bay16.com/charcodeat/fix2.html

I skipped the 0x00 char. It couldn't possibly be an issue?

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

Nice job, but please take a look at my new version:

http://xem.github.io/JSciissors/fix.html

I managed to fit the unicode-to-latin1 thing in just 156 chars instead of 271. (see line 20)

Now THAT's code golfing :D

Feel free to do better though :)

PS: the char 0x00 is also taken care of. That's mandatory, because JSciissors can output it just as any other char.

@michaelbeckius
Copy link

Awesome!

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

And 8 more bytes removed:
http://xem.github.io/JSciissors/fix.html

@michaelbeckius
Copy link

Impressive.
By the way, that issue that I had with Opera seems to be gone. Maybe it disappeared when they switched to webkit. At least one less problem to worry about. :)

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

So... after a lot of maths, my final word is: 139 chars. Less than a tweet!

http://xem.github.io/JSciissors/fix.html

@p01 @aemkei @subzey @rlauck @bburky, I invite you to golf it even more if you can :)

For a reminder, the goal is to write a function that returns the charCode of a character in a latin-1 string.

Here are the differences with the regular charCodeAt function:

65533 => 0
8364 => 128
8218 => 130
402 => 131
8222 => 132
8230 => 133
8224 => 134
8225 => 135
710 => 136
8240 => 137
352 => 138
8249 => 139
338 => 140
381 => 142
8216 => 145
8217 => 146
8220 => 147
8221 => 148
8226 => 149
8211 => 150
8212 => 151
732 => 152
8482 => 153
353 => 154
8250 => 155
339 => 156
382 => 158
376 => 159

@xem
Copy link
Owner Author

xem commented Dec 31, 2013

@BananaScript was the bug you're talking about present on opera 12? I think it's still important to support that version.

@michaelbeckius
Copy link

For several years up until november 2013, there has been a test for this running for every visitor at http://bananascript.com and reporting back the results to the server. I checked the logs and the last version of Opera that had this problem was version 12.00. From version 12.01 and later, there has not been any more problems.

The problem was that Opera could not see a difference between the characters with ascii codes 129,141,143,144 and 157. They were all seen as the same charatcer by Opera with the same ascii code.

@xem
Copy link
Owner Author

xem commented Jan 1, 2014

Oh, cool! if everything works on opera 12.X (X>0) then it's fine for me. :)

@xem
Copy link
Owner Author

xem commented Jan 1, 2014

The issue is now fixed (but another one appeared...)

http://xem.github.io/JSciissors/

If you encode "alert(1)", and save the result in a latin-1 JS file, it is executed correctly!

If you encode jquery.min.js, it's screwed up... to be continued. #6

@xem xem closed this as completed Jan 1, 2014
@xem xem mentioned this issue May 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants