base131072, is an implementation of this encoding... however, it can't be used yet because there aren't enough safe Unicode characters.
Efficiency ratings are averaged over long inputs. Higher is better.
|Encoding||Efficiency||Bytes per Tweet *|
|ASCII‑constrained||Unary / Base1||0%||0%||0%||1|
* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.
For example, using Base64, up to 105 bytes of binary data can fit in a Tweet. With Base131072, 297 bytes are possible.
How does it work?
Base131072 is a 17-bit encoding. We take the input binary data as a sequence of 8-bit numbers, compact it into a sequence of bits, then dice the bits up again to make a sequence of 17-bit numbers. We then encode each of these 217 = 131,072 possible numbers as a different Unicode code point.
Note that the final 17-bit number in the sequence is likely to be "incomplete", i.e. missing some of its bits. We need to signal this fact in the output string somehow. Here's how we handle those cases.
Final 17-bit number has 1 to 7 missing bits
In the following cases:
bbbbbbbbcccccccc_ // 1 missing bit bbbbbbbcccccccc__ // 2 missing bits bbbbbbcccccccc___ // 3 missing bits bbbbbcccccccc____ // 4 missing bits (note: this is how a Tweet containing 297 bytes of data will end) bbbbcccccccc_____ // 5 missing bits bbbcccccccc______ // 6 missing bits bbcccccccc_______ // 7 missing bits
we pad the incomplete 17-bit number out to 17 bits using 1s:
bbbbbbbbcccccccc1 bbbbbbbcccccccc11 bbbbbbcccccccc111 bbbbbcccccccc1111 bbbbcccccccc11111 bbbcccccccc111111 bbcccccccc1111111
and then encode as normal using our 217-bit repertoire.
Final 17-bit number has 8 to 15 missing bits
In the following cases:
bcccccccc________ // 8 missing bits cccccccc_________ // 9 missing bits ccccccc__________ // 10 missing bits cccccc___________ // 11 missing bits ccccc____________ // 12 missing bits (note: this is how a Tweet containing 296 bytes of data will end) cccc_____________ // 13 missing bits ccc______________ // 14 missing bits cc_______________ // 15 missing bits
we encode them differently. We'll pad the incomplete number out to only 9 bits using 1s:
bcccccccc cccccccc1 ccccccc11 cccccc111 ccccc1111 cccc11111 ccc111111 cc1111111
and then encode them using a completely different, 29-character repertoire. On decoding, we will treat that character differently, returning 9 bits, rather than 17 from characters in the main repertoire.
Final 17-bit number has 16 missing bits
In this final case:
c________________ // 16 missing bits
we simply take this as a 1-bit number:
and encode it using a third, 21-character repertoire. Again, on decoding, this is treated specially, and only 1 bit is added to the stream, rather than 9 or 17 as for the other characters.
In other words, Base131072 is a slight misnomer. It uses not 131,072 but 217 + 29 + 21 = 131,586 characters for its three repertoires. Of course, Base64 uses a 65th character for its padding too.
On decoding, we get a series of 8-bit values, the last of which might be incomplete, like so:
1_______ // 7 missing bits 11______ // 6 missing bits 111_____ // 5 missing bits 1111____ // 4 missing bits 11111___ // 3 missing bits 111111__ // 2 missing bits 1111111_ // 1 missing bit
These are the padding 1s added at encoding time. We can check this and discard this final value.
Is this ready yet?
No. We need 131,586 "safe" characters for this encoding, but as of Unicode 9.0 only 108,397 exist. However, future versions of Unicode may add enough safe characters for this to become possible. In any case, the groundwork can certainly be laid.