Skip to content
LZ-String encoding for Java (cross-platform, including GWT)
Java
Branch: master
Clone or download
tommyettinger Minor cleanup; release 1.4.4.3
This eliminates a seemingly-unnecessary step that wasted memory when decompressing URI-encoded text, and was putting one non-URI-safe char in URI-encoded text. It seems to encode and decode exactly the same.
Latest commit 579354f Jul 7, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
demo/TransmissionDemo Fixes for GWT-supersourced decompression. Benchmarks Jul 3, 2019
docs Bugfix version 1.4.4.2; avoid corrupting large texts Mar 19, 2018
src Minor cleanup; release 1.4.4.3 Jul 8, 2019
.gitattributes Imported, Maven-ized, optimized, tested. Sep 14, 2016
.gitignore Add tests, pre-built sample applications to test IO Jan 9, 2018
Consumer.jar Add big demo. Consumer/Producer can specify mode Jan 10, 2018
LICENSE Initial commit Sep 14, 2016
PrincessOfMars.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
PrincessOfMarsBase64.txt
PrincessOfMarsURI.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
PrincessOfMarsUTF16.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
Producer.jar Add big demo. Consumer/Producer can specify mode Jan 10, 2018
README.md Change link to newer version on Maven Central Mar 19, 2018
TuopaTuopiTuimanTunnon.txt Fix stupid bug in Producer, add more test data Jan 10, 2018
TuopaTuopiTuimanTunnonBase64.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
TuopaTuopiTuimanTunnonURI.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
TuopaTuopiTuimanTunnonUTF16.txt Add big demo. Consumer/Producer can specify mode Jan 10, 2018
pom.xml Fixes for GWT-supersourced decompression. Benchmarks Jul 3, 2019

README.md

BlazingChain

Maven Central

This one-file library can be used to compress Java/JVM Strings with LZ-String encoding, as developed by pieroxy for JavaScript and continued by rufushuang in a Java port. This code is a cleaned-up and optimized copy of rufushuang's lz-string4java, and is MIT-licensed like that project.

LZ-String encoding can offer significant compression on UTF-16 Strings, like those in Java or in a web browser's (tightly constrained) LocalStorage. A simple example of the world "hello" repeated 15 times, each word followed by one of 15 different hex digits, goes from 90 UTF-16 chars to 23 UTF-16 chars with the default LZ-String encoding. There are options in the library for Base64 compression and URI component compression as well, using only the chars possible in those formats at a substantial loss to compression if storing as UTF-16 (but a slight gain if you can store the Base64 or URI-encoded chars as 6 bits instead of 16 bits, or an about even comparison with the default UTF-16 compression scheme if you use 8 bits). UTF-8 does rather well to begin with at lowering content size for ASCII text, so a reduction to about 2/3 as many bytes should be the most expected if you encode as Base64. Since UTF-8 chars waste about 2 bits per byte when storing Base64 data, the gain is not due to better usage of the individual bits per char, but rather thanks to the usage of a modified LZW compression on the text. LZW is a type of compression that does especially well at compressing repetitive data in the .7z archive format (excepting slow and heavy-weight arithmetic coding techniques, which may do better on file size, .7z with the LZMA algorithm was the only format I found that could compress a 13GB folder of immensely-repetitive data down to about 60 MB, though less common and similar formats like .xz and .lz also use the same or similar algorithm). Any patents on LZW seem to have expired and it is common in various software.

This particular version of LZ-String encoding has been optimized on top of rufushuang's optimizations, removing all boxing of char primitives, almost all boxing of int primitives (only to allow usage of a generic HashMap with Integer keys), much unnecessary conversion between primitive types, all anonymous inner classes, and a few other performance tweaks, like appending to one StringBuilder instead of the earlier approach of making an ArrayList of boxed Characters, appending to that, and then re-appending each Character to another StringBuilder. If premature optimization is the root of all evil, I need an exorcist, but thankfully the code is small enough that not too much extra work was needed in the original Java code. Javadocs are available in the code and on Maven Central, but the method names are clear and the API surface is small at 8 methods, half for compression and half for decompression. Some small examples (really, really small) are below.

A preview is available here on github.io, which shows the URI-encoding form of compression (mainly because the full range of UTF-16 characters used by UTF16 mode couldn't be displayed by most web browsers, so the compressed result would be either unreadable or un-copy-able). The preview runs with GWT, but if you have Strings compressed by this library using other JVM types (in URI-encoded mode only for now), you can enter the compressed Strings on the right and click "<- Decompress" to show their contents at left. You can also enter uncompressed text at left and compress it with "Compress ->", writing to the right pane.

Usage

import blazing.chain.LZSEncoding; // or you can import static , this is all static.
...
String longText, compressed, decompressed;
longText = "This is some long, long, long, long, long, repetitive text!";
////These next two lines use the tightest encoding; it can use all of Unicode,
//// but may produce invalid UTF-16 codepoint pairs. It should be noted that
//// invalid pairs can cause a compressed file to be read back incorrectly if
//// it has made a round-trip to the filesystem saved as UTF-16, UTF-8, or
//// possibly any encoding other than binary. If you aren't saving the compressed
//// String as its exact bytes, you should prefer a different pair of methods.
compressed = LZSEncoding.compress(longText);
decompressed = LZSEncoding.decompress(compressed);
////you can try the next line if you want to make sure they really are equal.
//assert(longText.equals(decompressed));

////Other encodings have similar pairings of compress method to decompress method.

////This kind of encoding uses 15 of the 16 bits in a UTF-16 char, but should
//// always produce valid UTF-16. It does not compress quite as well as the first
//// method, but is compatible with various places that primarily use UTF-16.
////This is the recommended way of using the library if files are involved.
////For optimal file size, save files in UTF-16 encoding when compressed this way.
//compressed = LZSEncoding.compressToUTF16(longText);
//decompressed = LZSEncoding.decompressFromUTF16(compressed);

////This kind of encoding uses pure ASCII, specifically the 64 Base64 characters,
//// plus possibly '=' as a suffix.
//compressed = LZSEncoding.compressToBase64(longText);
//decompressed = LZSEncoding.decompressFromBase64(compressed);

////This kind of encoding uses pure ASCII, specifically the 64 characters that
//// are valid in URI component encoding.
//compressed = LZSEncoding.compressToEncodedURIComponent(longText);
//decompressed = LZSEncoding.decompressFromEncodedURIComponent(compressed);

Installation

You can get this version (which should be compatible with lz-string 1.4.4) using this info on Maven Central. That page provides dependency info for many build tools including Maven, Gradle, Ivy, SBT, and Lein. There should be a release on GitHub as well. For GWT, you will need this inherits line:

<inherits name='blazing.chain' />

Other

The name is a play on the LZ in Blazing and LZ-String, and Chain being a String-like object, but is also a reference to an obscure, no-longer-canon group from the distant past of a particular far, far away galaxy.

Included for test purposes are a public domain poem in Finnish (Suomi) called "Tuopa tuopi tuiman tunnon", by August Ahlqvist (retrieved from Wikisource here), and the third paragraph of the public domain novel "A Princess of Mars" by Edgar Rice Burroughs (retrieved from Wikisource here). I have no idea what the poem means, but it mixes ASCII and non-ASCII characters so it serves as good test data. Each has versions in uncompressed form as well as compressed with UTF16, URI Encoding, and Base-64 modes. The mode corresponding to compress() and uncompress() is not provided because I don't know how to accurately write its invalid UTF-16 codepoints to disk.

You can’t perform that action at this time.