Binary-to-text encoding highly optimised for UTF-16
Switch branches/tags
Nothing to show
Clone or download
Latest commit db8d6c6 Jun 3, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib Use ArrayBuffers Nov 19, 2017
test Use ArrayBuffers Nov 19, 2017
.editorconfig Use ArrayBuffers Nov 19, 2017
.gitattributes 💥🐫 Added .gitattributes & .gitignore files Apr 3, 2016
.gitignore Language-agnostic test case files Jan 29, 2017
README.md Update README.md Jun 2, 2018
index.js Use ArrayBuffers Nov 19, 2017
package.json Use ArrayBuffers Nov 19, 2017

README.md

base32768

Base32768 is a binary encoding optimised for UTF-16-encoded text. This JavaScript module, base32768, is the first implementation of this encoding.

The efficiency chart speaks for itself. Efficiency ratings are averaged over long inputs. Higher is better.

Encoding Efficiency Bytes per Tweet *
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary / Base1 0% 0% 0% 1
Binary 13% 6% 3% 35
Hexadecimal 50% 25% 13% 140
Base64 75% 38% 19% 210
Base85 † 80% 40% 20% 224
BMP‑constrained HexagramEncode 25% 38% 19% 105
BrailleEncode 33% 50% 25% 140
Base2048 56% 69% 34% 385
Base32768 63% 94% 47% 263
Full Unicode Ecoji 31% 31% 31% 175
Base65536 56% 64% 50% 280
Base131072 53%+ 53%+ 53% 297

* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

Base32768 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..

Installation

npm install base32768

Usage

const base32768 = require("base32768")

const uint8Array = new Uint8Array([104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100])

const string = base32768.encode(uint8Array.buffer); 
console.log(string); // 6 code points, '媒腻㐤┖ꈳ埳'

const uint8Array2 = new Uint8Array(base32768.decode(string));
console.log(uint8Array2); // [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

API

base32768.encode(buf)

Encodes an ArrayBuffer and returns a Base32768 String, suitable for passing safely through almost any "Unicode-clean" text-handling API. This string contains no special characters and is immune to Unicode normalization. Give or take some padding characters, the output string has 1 character per 15 bits of input.

All characters are chosen from the Basic Multilingual Plane. This means that when encoded as UTF-16, all characters occupy 16 bits. Thus, there are 16 bits of output UTF-16 text per 15 bits of input, an efficiency of 93.75%.

base32768.decode(str)

Decodes a Base32768 String and returns an ArrayBuffer containing the original binary data.

License

MIT