Skip to content
Compression & decompression programs made for Data Compression course on University of Wrocław. Use the same algorithms as bzip.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
plots
tests
.gitignore
DeBWT.cpp
DeBWT.h
DemoveToFront.cpp
DemoveToFront.h
Huffman.cpp
Huffman.h
LexiBWT.cpp
LexiBWT.h
Makefile
MoveToFront.cpp
MoveToFront.h
README.md
StackTreeMemory.cpp
StackTreeMemory.h
SuffixBWT.cpp
SuffixBWT.h
SuffixTree.cpp
SuffixTree.h
SuffixTreeStructures.h
UniBWT.cpp
UniBWT.h
comp_time_gen.rb
compress-lexi.cpp
compress-suffix.cpp
compress-uni.cpp
decomp_time_gen.rb
decompress0.cpp
decompress1.cpp
eff_chart_gen.rb
runtests.rb
suffix.cpp
uni.cpp

README.md

Data compression program

These are simple programs written for 'Data compression' course on University of Wrocław.

Online (and prettier) version of this document is available here: https://github.com/swistak35/uwr_bwt_compression .

Compilation

make all will build all programs.

Usage

There are few versions of the programs, each one of them is using different sorting algorithm.

compress-{lexi,uni,suffix} input_file output_file block_size

Default block size is 256 bytes.

decompress{0,1} input_file output_file block_size

Default block size is 256 bytes.

Compression

  • compress-lexi is using lexicographical sorting algorithm. Thus, if we are sorting all shifts of string of length n, it's roughly O(n^2).
  • compress-uni is algorithm proposed on my university course. It's O(n*logn) and it's kind of lexicographical sorting, but using the fact that all of sorted strings are just a shifts.
  • compress-suffix. This algorithm just builts the suffix tree and by looking it left to right we get sorted strings.

Suffix theoretical background

Suffix trees are built using Ukkonen's algorithm. As stated above, this algorithm just create suffix tree and read suffixes left to right. However, we assume that there's a # letter at the end. By # we mean letter bigger than any other letter in used alphabet. Thus, this algorithm doesn't sort all shifts of the given string, but it sorts it's suffixes.

Decompression

There are two programs for decompression:

  • decompress0 - use that one, when you are decompressing from compress-lexi or compress-uni.
  • decompress1 - use this one, when you are decompressing from compress-suffix.

Obviously, one could made only one decompression program, which detects which sorting algorithm was used. We just would have to include a fingerprint of the sorting algorithm in the beginning of compressed file.

However, remember that this is only a university project. Thus it doesn't have best file structure and somewhere DRY is violated (pretty much only in compress-* and decompress* files).

Experiments

Whole purpose of this programs are experiments. That's why we have another script here, runtests.rb. It needs ruby to be ran. I was using Ruby 2.2.2, but all rubies >= 1.9.3 should be fine.

To run tests, run:

./runtests.rb <test_suit_name>

If test_suit_name is not provided, it's assumed to be one named default.

All test suites are defined in the top section of the file. Ruby is very readable language, so you should not have problems with understanding and modifying test suites.

Included test suites:

  • default - test suite used to test correctness of my programs. Mostly tests with sizes crafted to detect different off-by-one errors. Runs all of the algorithms with small sizes (256, 1024 bytes)
  • rfc - this test suite was created for purpose of testing how our programs compress bigger text files. There are 14 text files, each one of them include content of 500 RFC documents. These are pretty big files, most of them have 25-30MB. This was my favourite test for testing speed differences between uni and suffix algorithms.
  • mobi, epub, pdf - all of these are similar. Each one of them consists of 20 files in given format. I don't include in the repository as these were my personal ebooks. I used these plans to check efficiency (compression level) of given file formats.
  • bigeff - this test suite was for testing how efficiency changes when we start to use really big chunks (1..20 MB)
  • bin - bunch of binary files generated with dd tool. I think I will prepare script for generating these. These were 3-5MB in size.
  • trzysta - these also were generated by dd. However, they were all small (300B) files, I generated a thousand of them to have many random examples. I used them with hope that maybe they will show me some bugs.

Possible optimizations

Here I include possible optimizations which come to my mind, but I didn't have enough time and/or enthusiasm to introduce.

  • Dynamic huffman implementation which I am using here was found in the internet. It assumes that input is integers - rewriting it to accept one-byte-long unsigned chars could be beneficial.
  • Most of the classes could be rewritten using templates
  • Cache-line optimizations. Probably the suffix algorithm is the one which could gain here the most, as this is the one which access memory in pretty random way.
  • inlining simple functions
  • different MoveToFront implementation - it may be that implementation which use array instead of linked list will be much faster (both of them are O(n), but array have bigger constant) because of caching.
  • uni sorting may be inefficient because of using vectors, if they are often reallocated

It should be noted that, for example in suffix implementation, sorting part of the algorithm and MTF phase take roughly similar amount of time. Huffman encoding takes a row of magnitude less time. Also, I am writing here about CPU time, not real time, so Huffman <-> Filesystem communication may be good place to seek for improvements.

Tests & charts

'Big chunk' test

Big chunk test efficiency chart Big chunk test compression time chart Big chunk test decompression time chart

Ebooks test

Ebooks test efficiency chart

Text files test (RFC)

RFC test efficiency chart RFC test compression chart RFC test decompression chart

Short notes & conclusions

  • Before making this task I heard opinion that Suffix algorithm may be in the end slower than Uni algorithm because of very big constant in construction of suffix trees. From mine experiments, it looks that it's not true and Suffix algorithm is significantly better than Uni.
  • As one could prefict...:
  • ... packaged data (epub, mobi) are hard to compress effectively.
  • ... ordinary text files compress very well.
  • ... using bigger chunk gives better compression, but also prolongs compression time.
  • ... decompression time is few orders of magnitude smaller than compression time.

Implementation notes

Initially (and by accident) I started implementing McCreight algorithm, which is very similar. I tried to do it based on internet resources, but I failed with precise process how and when create suffix links in this algorithm. After that I implemented Ukkonen's algorithm, which I found quite easy to follow.

Uni algorithm may be significantly improved. Currently it uses vectors implementation from standard c++ library. It may be the case that this do some heavy reallocating. Substituting vectors by n*n array is not so easy, because if we'll use chunks with size even so small as 100KiB, it will need (100 * 1000)^2 = 10GB memory, at least. However, one could try to make some experiments and try to make array with n rows, but less than n columns (assume k columns), and if it happen to need more columns, we could just use linked list or vector for remaining elements.

Suffix implementation also probably may be improved. It represents string as a tree, so memory access to edges and branch nodes is very random. Waiting for data in our computers is estimated for 200 cycles, and if we are constantly missing the cache (because the data is scattered), we spend most time idling. BWT algorithm which we are using here, is also used in compression format widely known as bzip2. Author of bzip2 implementation also seems to agree with this statement. Quoting from bzip documentation:

bzip2 usually allocates several megabytes of memory to operate in, and then charges all over it in a fairly random fashion. This means that performance, both for compressing and decompressing, is largely determined by the speed at which your machine can service cache misses. Because of this, small changes to the code to reduce the miss rate have been observed to give disproportionately large performance improvements. I imagine bzip2 will perform best on machines with very large caches."

You can’t perform that action at this time.