Latest commit 72d6315
Feb 19, 2014
|Failed to load latest commit information.|
This is a public-domain implementation of several rANS variants. rANS is an entropy coder from the ANS family, as described in Jarek Duda's paper "Asymmetric numeral systems" (http://arxiv.org/abs/1311.2540). - "rans_byte.h" has a byte-aligned rANS encoder/decoder and some comments on how to use it. This implementation should work on all 32-bit architectures. "main.cpp" is an example program that shows how to use it. - "rans64.h" is a 64-bit version that emits entire 32-bit words at a time. It is (usually) a good deal faster than rans_byte on 64-bit architectures, and also makes for a very precise arithmetic coder (i.e. it gets quite close to entropy). The trade-off is that this version will be slower on 32-bit machines, and the output bitstream is not endian-neutral. "main64.cpp" is the corresponding example. - "rans_word_sse41.h" has a SIMD decoder (SSE 4.1 to be precise) that does IO in units of 16-bit words. It has less precision than either rans_byte or rans64 (meaning that it doesn't get as close to entropy) and requires at least 4 independent streams of data to be useful; however, it is also a good deal faster. "main_simd.cpp" shows how to use it. See my blog http://fgiesen.wordpress.com/ for some notes on the design. I've also written a paper on interleaving output streams from multiple entropy coders: http://arxiv.org/abs/1402.3392 this documents the underlying design for "rans_word_sse41", and also shows how the same approach generalizes to e.g. GPU implementations, provided there are enough independent contexts coded at the same time to fill up a warp/wavefront or whatever your favorite GPU's terminology for its native SIMD width is. Finally, there's also "main_alias.cpp", which shows how to combine rANS with the alias method to get O(1) symbol lookup with table size proportional to the number of symbols. I presented an overview of the underlying idea here: http://fgiesen.wordpress.com/2014/02/18/rans-with-static-probability-distributions/ Results on my machine (Sandy Bridge i7-2600K) with rans_byte in 64-bit mode: ---- rANS encode: 12896496 clocks, 16.8 clocks/symbol (192.8MiB/s) 12486912 clocks, 16.2 clocks/symbol (199.2MiB/s) 12511975 clocks, 16.3 clocks/symbol (198.8MiB/s) 12660765 clocks, 16.5 clocks/symbol (196.4MiB/s) 12550285 clocks, 16.3 clocks/symbol (198.2MiB/s) rANS: 435113 bytes 17023550 clocks, 22.1 clocks/symbol (146.1MiB/s) 18081509 clocks, 23.5 clocks/symbol (137.5MiB/s) 16901632 clocks, 22.0 clocks/symbol (147.1MiB/s) 17166188 clocks, 22.3 clocks/symbol (144.9MiB/s) 17235859 clocks, 22.4 clocks/symbol (144.3MiB/s) decode ok! interleaved rANS encode: 9618004 clocks, 12.5 clocks/symbol (258.6MiB/s) 9488277 clocks, 12.3 clocks/symbol (262.1MiB/s) 9460194 clocks, 12.3 clocks/symbol (262.9MiB/s) 9582025 clocks, 12.5 clocks/symbol (259.5MiB/s) 9332017 clocks, 12.1 clocks/symbol (266.5MiB/s) interleaved rANS: 435117 bytes 10687601 clocks, 13.9 clocks/symbol (232.7MB/s) 10637918 clocks, 13.8 clocks/symbol (233.8MB/s) 10909652 clocks, 14.2 clocks/symbol (227.9MB/s) 10947637 clocks, 14.2 clocks/symbol (227.2MB/s) 10529464 clocks, 13.7 clocks/symbol (236.2MB/s) decode ok! ---- And here's rans64 in 64-bit mode: ---- rANS encode: 10256075 clocks, 13.3 clocks/symbol (242.3MiB/s) 10620132 clocks, 13.8 clocks/symbol (234.1MiB/s) 10043080 clocks, 13.1 clocks/symbol (247.6MiB/s) 9878205 clocks, 12.8 clocks/symbol (251.8MiB/s) 10122645 clocks, 13.2 clocks/symbol (245.7MiB/s) rANS: 435116 bytes 14244155 clocks, 18.5 clocks/symbol (174.6MiB/s) 15072524 clocks, 19.6 clocks/symbol (165.0MiB/s) 14787604 clocks, 19.2 clocks/symbol (168.2MiB/s) 14736556 clocks, 19.2 clocks/symbol (168.8MiB/s) 14686129 clocks, 19.1 clocks/symbol (169.3MiB/s) decode ok! interleaved rANS encode: 7691159 clocks, 10.0 clocks/symbol (323.3MiB/s) 7182692 clocks, 9.3 clocks/symbol (346.2MiB/s) 7060804 clocks, 9.2 clocks/symbol (352.2MiB/s) 6949201 clocks, 9.0 clocks/symbol (357.9MiB/s) 6876415 clocks, 8.9 clocks/symbol (361.6MiB/s) interleaved rANS: 435120 bytes 8133574 clocks, 10.6 clocks/symbol (305.7MB/s) 8631618 clocks, 11.2 clocks/symbol (288.1MB/s) 8643790 clocks, 11.2 clocks/symbol (287.7MB/s) 8449364 clocks, 11.0 clocks/symbol (294.3MB/s) 8331444 clocks, 10.8 clocks/symbol (298.5MB/s) decode ok! ---- Finally, here's the rans_word_sse41 decoder on an 8-way interleaved stream: ---- SIMD rANS: 435626 bytes 4597641 clocks, 6.0 clocks/symbol (540.8MB/s) 4514356 clocks, 5.9 clocks/symbol (550.8MB/s) 4780918 clocks, 6.2 clocks/symbol (520.1MB/s) 4532913 clocks, 5.9 clocks/symbol (548.5MB/s) 4554527 clocks, 5.9 clocks/symbol (545.9MB/s) decode ok! ---- There's also an experimental 16-way interleaved AVX2 version that hits faster rates still, developed by my colleague Won Chun; I will post it soon. Note that this is running "book1" which is a relatively short test, and the measurement setup is not great, so take the results with a grain of salt. -Fabian "ryg" Giesen, Feb 2014.