New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Challenge: store 4000 characters in 4000 bytes #5
Comments
So, I kept searching what was going on... If you encode The problem is that if you save this in UTF-8, this string takes more bytes than its number of characters. So the idea is to save it in a iso-8859-1 (latin-1) file, since both charsets have the same first 256 characters, and latin1 stores each of them on 8 bits. Unfortunately, when we decode So close! But what happened? In the binary code, the 5th block of 8 bits has changed during the conversion:
The big question is: why did this character 0x8A become a character 0x70 ? ... |
Maybe @BananaScript can help with that, he had a similar issue... :) |
I don't have a solution unfortunately. For high ascii characters, charCodeAt() returns the unicode value for the character, not the latin1 value. This totally messes up the bits and is why decompression fails. For example, for the character with ascii code 0x8a, charCodeAt will return the value 352. |
If anyone wants to try their luck: http://bay16.com/charcodeat/ view source for more. |
@BananaScript, thanks for your answer! I understand that the charCodeAt function can only output unicode. Solution 1: fix the encoder to handle latin-1In the encoder, we have the binary code 0x8a to store in a character. Solution 2: fix the decoderThe decoder gets the character 0x70 instead of 0x8A Solution 3: Use directly binary instead of a stringInstead of a string, that suffers from those charsets limitations, we could store the encoded JS in a binary file. |
@BananaScript here's my implementation of the solution 2 for your test file:
The ideal would be to do implement the solution 1 (if it's possible), to keep the decoder as short as possible. |
Update: I made a full decoder corresponding to the solution 2. It fixes all the differences between latin-1 and unicode. The source contains all latin1 chars from 0 to 255 and the decoder. http://xem.github.io/JSciissors/fix.html this solution works, but it adds about 250 chars to the decoder present on the JSciissors homepage... |
How about this, just a little bit shorter (32 bytes or so): http://bay16.com/charcodeat/fix2.html I skipped the 0x00 char. It couldn't possibly be an issue? |
Nice job, but please take a look at my new version: http://xem.github.io/JSciissors/fix.html I managed to fit the unicode-to-latin1 thing in just 156 chars instead of 271. (see line 20) Now THAT's code golfing :D Feel free to do better though :) PS: the char 0x00 is also taken care of. That's mandatory, because JSciissors can output it just as any other char. |
Awesome! |
And 8 more bytes removed: |
Impressive. |
So... after a lot of maths, my final word is: 139 chars. Less than a tweet! http://xem.github.io/JSciissors/fix.html @p01 @aemkei @subzey @rlauck @bburky, I invite you to golf it even more if you can :) For a reminder, the goal is to write a function that returns the charCode of a character in a latin-1 string. Here are the differences with the regular charCodeAt function: 65533 => 0 |
@BananaScript was the bug you're talking about present on opera 12? I think it's still important to support that version. |
For several years up until november 2013, there has been a test for this running for every visitor at http://bananascript.com and reporting back the results to the server. I checked the logs and the last version of Opera that had this problem was version 12.00. From version 12.01 and later, there has not been any more problems. The problem was that Opera could not see a difference between the characters with ascii codes 129,141,143,144 and 157. They were all seen as the same charatcer by Opera with the same ascii code. |
Oh, cool! if everything works on opera 12.X (X>0) then it's fine for me. :) |
The issue is now fixed (but another one appeared...) http://xem.github.io/JSciissors/ If you encode "alert(1)", and save the result in a latin-1 JS file, it is executed correctly! If you encode jquery.min.js, it's screwed up... to be continued. #6 |
Yesterday, @p01 noticed a (huge) problem about JSciissors:
the compression reduces the number of characters but increase the final size in bytes.
Of course, this side effect was not intentional.
In fact, we already made compressors based on the number of characters, which is way more efficient than JSciissors:
https://gist.github.com/xem/7086007 => 2 chars in one
https://gist.github.com/xem/7584765 => 2.85 chars in one
But the goal of JSciissors wasn't to do that. It was only to strip the eightth bit of ascii chars (which is always "0"), concatenate their binary code, cut it in chunks of 8 bytes, and make a 8-bit character from it.
Turns out, this failed, because if we save the compressed code in a UTF-8 file, all non-ascii characters are stored on 2 bytes (or more).
But all hope is not lost, because the encoded file can be saved as ANSI a.k.a latin-1 a.k.a iso-8859-1, and if we do that, each character is really stored on 8 bits.
Unfortunately, the decoding is broken because the 8-bit characters order is different between UTF-8 and in ASCII.
So the challenge is to generate an encrypted string which, when it's read in ANSI, has the right characters, (and the right binary code), in order to be correctly decoded.
Challenge accepted.
cc. @aemkei @subzey
The text was updated successfully, but these errors were encountered: