Encodes arbitrary data into strings that display invisibly on devices and platforms supporting unicode.
Operates in base 4096, meaning 1.5 bytes per character on average. In practice the efficiency is marginally lower due to one or two padding characters required in specific cases, but this is negligible as the input size increases.
Originally a coding scheme designed for Miza, as one of the methods to hide small amounts of persistent data in text messages to represent instructions for future edits to the message, while remaining visually undisruptive to users.
Capable of encoding both byte-strings and text-strings, and distinguishing between the two, using an additional signifier character (0x1d17a).
See https://thomas-xin.github.io/invisicode for an interactive demo of the encoding! You may use this to verify whether various examples of encoded text or data correctly render invisibly on your device or platform.
pip install invisicode
encode(b: str | bytes | bytearray | memoryview | numpy.ndarray) -> str
# Encode bytes or text into invisicode's private-use glyph sequence.
decode(s: str | numpy.ndarray, expect: type = None, strict=True) -> bytes | str
# Decode an invisicode glyph sequence into bytes or text, enforcing optional type expectations.
l128_encode(s: str) -> memoryview
# Encode a text string using variable-length base-128 encoding.
l128_decode(b: bytes | bytearray | memoryview) -> str
# Decode bytes produced by l128_encode back into a Unicode string.
is_invisicode(s: str | numpy.ndarray, strict: bool = True)
# Return whether a string or array contains only invisicode code points. In non-strict mode, allow strings containing any invisicode code points, as well as empty strings.
detect(s: str | numpy.ndarray) -> numpy.ndarray
# Locate contiguous invisicode segments within the provided text.
detect_and_decode(s: str | numpy.ndarray, expect: type = None) -> list
# Detect all invisicode substrings in the input and decode each one.- Encoding and decoding regular binary data
import invisicode
data = b"Hello World!"
encoded = invisicode.encode(data) # '\U000e0548\U000e06c6\U000e0f6c\U000e0206\U000e0f57\U000e0726\U000e046c\U000e0216'
assert invisicode.decode(encoded) == data # b"Hello World!"- Encoding and decoding a regular string
import invisicode
data = "Hello World! ❤️"
encoded = invisicode.encode(data) # '\U0001d17a\U000e0548\U000e06c6\U000e0f6c\U000e0206\U000e0f57\U000e0726\U000e046c\U000e0216\U000e0420\U000e04ee\U000e0c8f\U000e003f'
assert invisicode.decode(encoded) == data # 'Hello World! ❤️'- Encoding and decoding a (relatively) large amount of binary data
import invisicode
import numpy as np
data = np.random.randint(0, 256, size=10 ** 8, dtype=np.uint8)
encoded = invisicode.encode(data) # '\U000e05b7\U000e0504\U000e02cc\U000e09a9\U000e0df5\U000e0066\U000e0d96󠅋\U000e0959\U000e0469...
len(data), len(encoded) # (100000000, 66666667)
assert invisicode.decode(encoded) == data.tobytes()- Invisicode exposes LEB128 encodings for strings, which is also internally used for slight coding efficiency improvements over UTF-8 (as we are reencoding the information anyway, the redundancy/error checking normally provided by UTF-8 is of no use to us).
import invisicode
invisicode.l128_encode("test") # memoryview(b'test')
invisicode.l128_encode("Hello World! ❤️") # memoryview(b'Hello World! \xe4N\x8f\xfc\x03'); 18 bytes vs 19 for utf-8
assert invisicode.l128_decode(invisicode.l128_encode("驈ꍬ啯ꍲᕤ")) == "驈ꍬ啯ꍲᕤ"- Note: All numbers are encoded as little-endian bytes where applicable.
The encoding is performed as follows:
- If the input is a string, encode it as leb128 representation (slightly more space-efficient than utf-8), and start with a string prefix character 0x1d17a (a non-printable character outside the normal invisicode range).
- Each group of 3 bytes from the input is converted to two base-4096 numbers, by reinterpreting as a base-16777216 number and then splitting.
- 0xE0000 is added to each resulting number, placing it in the Tags and selector and subsequent blocks, which will typically render as non-printable, non-breaking spaces.
- If there is a single trailing byte (length % 3 == 1), it is encoded by itself by adding 0xE0000.
- If there are two trailing bytes (length % 3 == 2), they are encoded similarly, but with a padding character 0xE0FFF appended at the end. This enables the string to still contain an odd amount of characters and stay within invisicode's normal range, while being distinct from the (length % 3 == 1) case.
The decoding is performed as follows:
- If the string begins with the string prefix character (0x1d17a), remove that and flag the content as string.
- If there are an odd number of characters, there are trailing bytes present. Attempt to detect the padding character to determine whether one or two bytes should be extracted.
- 0xE0000 is subtracted from remaining characters; this step should raise an exception if any would go below 0.
- The results are interpreted as base-16777216 numbers, split into three base-256 numbers each, and reinterpreted as bytes.
- If necessary, convert the result back to a string.
The text between the characters "X" and "Y" below may be decoded as invisicode. It contains 2173 invisible characters, and represents 3258 bytes of leb128-data that may then be further decoded into 2568 unicode characters. For comparison, UTF-8 would encode the same text as 3635 bytes.
X󠄠󠄠󠅷󠅮󠅮󠄠󠅮󠅮󠅳󠅧󠄠󠄠󠅭󠅮󠄠󠅮󠄠󠄠󠅳󠄠󠅮󠄠󠅎󠄠󠅮󠄠󠅳󠅧󠄠󠄠󠅮󠅮󠅮󠄠󠄠󠄠󠅭󠅮󠄠󠅮󠄠󠅮󠅮Y