Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratio implementation overflows? #17

Open
andrew-aladev opened this issue Nov 25, 2018 · 6 comments
Open

Ratio implementation overflows? #17

andrew-aladev opened this issue Nov 25, 2018 · 6 comments
Labels

Comments

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 25, 2018

Hello. I couldn't understand ratio implementation. I think that there are some overflow issues in current implementation. Please correct me if I am wrong.

I see the main idea: we have source and destination.
Ratio equals to source_length / destination_length = s / d.
New ratio (s + 2) / (d + 1) is good, (s + 1) / (d + 2) is bad.
So we want to reset when new_ratio < old_ratio.
We won't reset when (s + 2) / (d + 1) > s / d, we will reset when (s + 1) / (d + 2) < s / d.

Than we added 10000 bytes lag for source_length to receive more consolidated ratio.

I see implementation for this algorithm in ruby in rb-compress-lzw. I have no questions about this implementation because there is a gmp library behind it, it will never overflow.

Now I am trying to read implementation in void compress(fdin, fdout).
I see manipulations:

  1. int bytes_out = 0; bytes_in = 0
  2. bytes_out += OBUFSIZ
  3. bytes_out += (outbits+7)>>3
  4. bytes_in += i
  5. if (rpos > rlop) bytes_in += rpos-rlop

Lets imagine large input.
Both bytes_in and bytes_out will overflow.
When dictionary will be filled, we will use bytes_in and bytes_out to count wrong ratio.

I see same problems here.

rat = (bytes_out+(outbits>>3)) >> 8;: bytes_out + (outbits >> 3) can provide overflow and rat will be invalid.

How to fix it?
I have not yet invented any solution =)

@andrew-aladev
Copy link
Contributor Author

andrew-aladev commented Nov 25, 2018

We want to reset when (s + m) / (d + n) < s / d.
m is new source bytes m >= 10000, n is new destination bytes length.
We can add special condition to exclude incident when s == 0 or d == 0 or n == 0.
So s > 0, d > 0 and n > 0.

(s + m) / (d + n) - s / d < 0.
Let's multiply by d (d + n) > 0.
sd + md - sd - sn < 0.
md - sn < 0
md < sn

So we can drop inaccurate division. Than I will use GMP to implement it. 2 multiplication with GMP for each 10000 bytes is not a big deal. I can implement it in ncompress too. But would you accept new GMP dependency?

@andrew-aladev
Copy link
Contributor Author

I've found a small issue:

outbuf[0] = MAGIC_1;
outbuf[1] = MAGIC_2;
outbuf[2] = (char)(maxbits | block_mode);
boff = outbits = (3<<3);

rat = (bytes_in << 8) / (bytes_out+(outbits>>3));

3 header bytes is not related to compression. You forget to remove them from bytes_out.

#define HEADER_BYTES_LENGTH 3
rat = (bytes_in << 8) / (bytes_out + (outbits >> 3) - HEADER_BYTES_LENGTH)

Main condition is a bit wrong:

if (!stcode && bytes_in >= checkpoint && fcode.e.ent < FIRST)

I think it should not be possible for us to go into with clear code. We can just use 256:

if (!stcode && bytes_in >= checkpoint && fcode.e.ent < 256)

@andrew-aladev
Copy link
Contributor Author

I couldn't decode this "bomb":

boff = outbits = (outbits-1)+((n_bits<<3)-((outbits-boff-1+(n_bits<<3))%(n_bits<<3)));

It looks impossible.

2ns1mk

@andrew-aladev
Copy link
Contributor Author

andrew-aladev commented Nov 29, 2018

@vapier, Hello, could you please describe what is boff?

Let's try to simplify it.

boff = outbits = (outbits-1)+((n_bits<<3)-((outbits-boff-1+(n_bits<<3))%(n_bits<<3)));

size_t n = n_bits * 8; // Why should we multiply `n_bits` by 8 instead of division? I don't know.
size_t some_bits = outbits - 1 + n;
size_t new_bits = some_bits - (some_bits - boff) % n;

boff = outbits = new_bits;

Can I just ignore this formula and continue implementing lzw? No.

Let's modify code a bit.

fprintf(stderr, "clear, source length: %lu, destination length: %lu\n", bytes_in, bytes_out + (outbits >> 3) - 3);

ratio = 0;
clear_htab();
output(outbuf,outbits,CLEAR,n_bits);

size_t n = n_bits << 3;
size_t some_bits = outbits - 1 + n;
size_t new_bits = some_bits - (some_bits - boff) % n;
fprintf(stderr, "clear, surprise offset (bits): %ld\n", new_bits - outbits);
boff = outbits = new_bits;

extcode = MAXCODE(n_bits = INIT_BITS)+1;
free_ent = FIRST;
stcode = 1;

Now let's try to compress example.

wget "http://norvig.com/big.txt"
head -c 571881 big.txt > fit.txt
lzws-cli < fit.txt > fit_1.txt.Z

compress < fit.txt > fit_2.txt.Z
clear, source length: 571880, destination length: 223882
clear, surprise offset (bits): 32

tail -c 15 fit_1.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[201, 251, 3, 1, 247, 39, 17, 44, 199, 6, 0, 1, 40, 194, 0]

tail -c 15 fit_2.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[247, 39, 17, 44, 199, 6, 0, 1, 0, 0, 0, 0, 40, 194, 0]

You can see 32 zero bits (4 bytes). My lzws-cli has no magic zeroes.

head -c 1146861 big.txt > fit.txt

compress < fit.txt > fit_2.txt.Z
clear, source length: 1146860, destination length: 454566
clear, surprise offset (bits): 64

tail -c 15 fit_2.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[121, 70, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 97, 196, 0]

You can see 64 zero bits (8 bytes) without any reason.

I've found version 4.0 from 2001 year here.
Second archive:

} else {
  ratio = 0;

  #ifdef DEBUG
  if(verbose)
    dump_tab();	/* dump string table */
  #endif

  cl_hash ( (count_int) hsize );

  free_ent = FIRST;
  clear_flg = 1;
  output ( (code_int) CLEAR );
  #ifdef DEBUG
  if(debug)
    fprintf ( stderr, "clear\n" );
  #endif /* DEBUG */
}

So this surprise has been planted between 4.0 (2001) and 4.2.4 (2006) year.
Nobody knows what does it mean but it is required.
Both ncompress and gzip unlzw can't decompress content without this surprise offset because of formula inside them.

@vapier
Copy link
Owner

vapier commented Jan 4, 2019

i don't think the code is obfuscated so much as it was written decades ago when coding standards were significantly different. advice on the original implementation is pretty much lost at this point so your guess is probably as good as mine.

iirc, the gzip & ncompress had a lot of code/algorithm copied between them log ago which is why they're similar.

@andrew-aladev
Copy link
Contributor Author

@vapier, I've just finished implementing decompressor and I can confirm that I can't decompress ncompress output because of zeroes provided by formula above. I wan't ever touch this formula if it wan't provide great issue.

Please use example provided above about big.txt. From decompressor point of view: I am receiving zeroes after clear code without any reason. Ncompress outputs 16 bit clear code, than random amount of zeroes. I've reproduced 4 and 8 bytes zeroes.

I can fix this issue in ncompress by removing this formula away but new version of ncompress won't be compatible with previous version in production and nobody will accept such pull request.

So we can't fix this issue in ncompress. We have to document this issue as known bug. We need to find a way for new software to detect that data was compressed by ncompress with broken formula. Than decompressor should skip zeroes. Compressor have to output same amount of zeroes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants