Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for additions to coding compression chapter #1030

Open
courtneycb opened this issue Jun 26, 2019 · 0 comments
Open

Suggestions for additions to coding compression chapter #1030

courtneycb opened this issue Jun 26, 2019 · 0 comments

Comments

@courtneycb
Copy link
Contributor

Some questions arose about the coding compression chapter and Tim has provided answers to these questions below. It would be good to make sure the answers are covered somewhere in the chapter, even if as an "extra for experts".

Question:

Take run-length-encoding for example. Does it only work for black and white images (or two-color systems)? If it is just black and white, I assume I can use one bit to represent each pixel. With run-length representation, one comma will take 8 bits (one character) and each number may also take 8 bits (not sure how the numbers are stored in this case). This confuses me as how effective the compression method is.

Tim's answer:
The explanation in the field guide leaves this a bit vague. Runlength works particularly well for two colours, because you don't need to say what the colour is, just the length of the runs. In practice, the numbers are written in binary, and you don't need a comma. The fax system (for example) uses Huffman coding to represent the number, but to give a simplified version, you could use (say) 4 bits for each run. So for the colours:

WWWWWWBBBWWBBBBBBBWWWWWWWWWW

the runs could be given in a readable form as

6, 3, 2, 7, 10

but on a computer would be 4-bit binary numbers:

0110 0011 0010 0111 1010

and they don't really have spaces of course, so it's:

01100011001001111010

That's 20 bits used to represent what was 28 pixels (so 28 bits uncompressed). Not much compression, and maybe I could have used 3 bits for each run. But how would you deal with the run of 10? (There's a good answer to this, but I'll see if you can come up with it!)
Huffman coding removes the "shall I use 3 bits or 4 bits or... " question, and works out the best number of bits to use in each case.

Question:

With the Ziv-Lempel method, each reference contains two numbers. From Tim's video, the first reference number usually takes a maximum of 11 to 12 bits and the second 3 or 4 bits. So the two numbers use 16 bits or 2 bytes. Does it mean, when compressing with this method, each repeating pattern of text should be at least two characters long to make it even and a pattern with more than two characters will starting saving space?

Tim's answer:
Correct, for that choice of bit lengths.

Question:

When the algorithm is run, does it only search for patterns with at least two characters and will not refer to a single character?

Tim's answer:
Yes, although in practice shorter codes can be used (Huffman again!), but single character matches are unlikely to be helpful. And there are many variations that address this question in different ways.

Question:

What does the computer use to detect/decide repeating patterns since the patterns can be of various length?

Tim's answer:
A longest-match variation of a searching algorithm. There are some very cool ways to do this, but way beyond high school expectations. Most practical systems use a hash search to find the two- or three- character match, and then keep checking characters one by one after that. But it's easier to explain with a whiteboard. The main thing is that you can see that it's an issue, and yet another reason we need good algorithms.

Question:

Huffman tree, I have heard of it before but never had a chance to find out how it works. Tim's video is a great explanation on another fascinating compression idea. Is there more than one version of the Huffman tree? Or the one based on frequency is the standard one?

Tim's answer:
They are always based on frequency, since the whole point is to allocate short codes to common (i.e. high frequency) symbols. Usually it's illustrated with letters of the alphabet ("e" is common, so gets a short code), but it applies to any "symbol" e.g. for run-length, runs of 2 or 3 black pixels might be common, so they get short codes, whereas runs of 6 or more white pixels might be more common.
If you're interested, the Huffman codes for fax machines are on page 13 of this document; the shortest white run codes are for length 2 to 7, whereas black runs are most commonly just 2 or 3 pixels. Faxes always use the same codes, rather than adjusting them for a particular document. Whole rabbit hole of discussion there!!

Question:

For a standard piece of English text, which method is usually more effective, Ziv-Lempel or Huffman? Can they be used together or the algorithm needs to evaluate and then choose the better option?

Tim's answer:
The best result (and commonly used, e.g. in gzip, zip, rar, PNG) is a combination. Huffman on its own isn't great because it assumes all characters are independent e.g. after "Buckingham palace announced that the Q" it assumes that the most likely next character is a space or "e". LZ looks for previous chunks starting with " Q", and is pretty likely to find " Qu" or even " Queen".
To combine them, the length of matches (typically 2 to 12 characters) are put into a Huffman algorithm, so the most common matches get the shortest code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant