Binary-to-text encoding optimised for Twitter & UTF-32
Branch: master
Clone or download
Latest commit 8ac635b Jun 3, 2018

README.md

base131072

Base131072 is a binary encoding optimised for UTF-32-encoded text and Twitter; it is the intended successor to Base65536. This JavaScript module, base131072, is an implementation of this encoding... however, it can't be used yet because there aren't enough safe Unicode characters.

Efficiency ratings are averaged over long inputs. Higher is better.

Encoding Efficiency Bytes per Tweet *
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary / Base1 0% 0% 0% 1
Binary 13% 6% 3% 35
Hexadecimal 50% 25% 13% 140
Base64 75% 38% 19% 210
Base85 † 80% 40% 20% 224
BMP‑constrained HexagramEncode 25% 38% 19% 105
BrailleEncode 33% 50% 25% 140
Base2048 56% 69% 34% 385
Base32768 63% 94% 47% 263
Full Unicode Ecoji 31% 31% 31% 175
Base65536 56% 64% 50% 280
Base131072 53%+ 53%+ 53% 297

* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

For example, using Base64, up to 105 bytes of binary data can fit in a Tweet. With Base131072, 297 bytes are possible.

How does it work?

Base131072 is a 17-bit encoding. We take the input binary data as a sequence of 8-bit numbers, compact it into a sequence of bits, then dice the bits up again to make a sequence of 17-bit numbers. We then encode each of these 217 = 131,072 possible numbers as a different Unicode code point.

Padding

Note that the final 17-bit number in the sequence is likely to be "incomplete", i.e. missing some of its bits. We need to signal this fact in the output string somehow. Here's how we handle those cases.

Final 17-bit number has 1 to 7 missing bits

In the following cases:

bbbbbbbbcccccccc_ // 1 missing bit
bbbbbbbcccccccc__ // 2 missing bits
bbbbbbcccccccc___ // 3 missing bits
bbbbbcccccccc____ // 4 missing bits (note: this is how a Tweet containing 297 bytes of data will end)
bbbbcccccccc_____ // 5 missing bits
bbbcccccccc______ // 6 missing bits
bbcccccccc_______ // 7 missing bits

we pad the incomplete 17-bit number out to 17 bits using 1s:

bbbbbbbbcccccccc1
bbbbbbbcccccccc11
bbbbbbcccccccc111
bbbbbcccccccc1111
bbbbcccccccc11111
bbbcccccccc111111
bbcccccccc1111111

and then encode as normal using our 217-bit repertoire.

Final 17-bit number has 8 to 15 missing bits

In the following cases:

bcccccccc________ // 8 missing bits
cccccccc_________ // 9 missing bits
ccccccc__________ // 10 missing bits
cccccc___________ // 11 missing bits
ccccc____________ // 12 missing bits (note: this is how a Tweet containing 296 bytes of data will end)
cccc_____________ // 13 missing bits
ccc______________ // 14 missing bits
cc_______________ // 15 missing bits

we encode them differently. We'll pad the incomplete number out to only 9 bits using 1s:

bcccccccc
cccccccc1
ccccccc11
cccccc111
ccccc1111
cccc11111
ccc111111
cc1111111

and then encode them using a completely different, 29-character repertoire. On decoding, we will treat that character differently, returning 9 bits, rather than 17 from characters in the main repertoire.

Final 17-bit number has 16 missing bits

In this final case:

c________________ // 16 missing bits

we simply take this as a 1-bit number:

c

and encode it using a third, 21-character repertoire. Again, on decoding, this is treated specially, and only 1 bit is added to the stream, rather than 9 or 17 as for the other characters.

In other words, Base131072 is a slight misnomer. It uses not 131,072 but 217 + 29 + 21 = 131,586 characters for its three repertoires. Of course, Base64 uses a 65th character for its padding too.

Decoding

On decoding, we get a series of 8-bit values, the last of which might be incomplete, like so:

1_______ // 7 missing bits
11______ // 6 missing bits
111_____ // 5 missing bits
1111____ // 4 missing bits
11111___ // 3 missing bits
111111__ // 2 missing bits
1111111_ // 1 missing bit

These are the padding 1s added at encoding time. We can check this and discard this final value.

Is this ready yet?

No. We need 131,586 "safe" characters for this encoding, but as of Unicode 9.0 only 108,397 exist. However, future versions of Unicode may add enough safe characters for this to become possible. In any case, the groundwork can certainly be laid.