Ratio implementation overflows? #17

andrew-aladev · 2018-11-25T13:22:42Z

Hello. I couldn't understand ratio implementation. I think that there are some overflow issues in current implementation. Please correct me if I am wrong.

I see the main idea: we have source and destination.
Ratio equals to source_length / destination_length = s / d.
New ratio (s + 2) / (d + 1) is good, (s + 1) / (d + 2) is bad.
So we want to reset when new_ratio < old_ratio.
We won't reset when (s + 2) / (d + 1) > s / d, we will reset when (s + 1) / (d + 2) < s / d.

Than we added 10000 bytes lag for source_length to receive more consolidated ratio.

I see implementation for this algorithm in ruby in rb-compress-lzw. I have no questions about this implementation because there is a gmp library behind it, it will never overflow.

Now I am trying to read implementation in void compress(fdin, fdout).
I see manipulations:

int bytes_out = 0; bytes_in = 0
bytes_out += OBUFSIZ
bytes_out += (outbits+7)>>3
bytes_in += i
if (rpos > rlop) bytes_in += rpos-rlop

Lets imagine large input.
Both bytes_in and bytes_out will overflow.
When dictionary will be filled, we will use bytes_in and bytes_out to count wrong ratio.

I see same problems here.

rat = (bytes_out+(outbits>>3)) >> 8;: bytes_out + (outbits >> 3) can provide overflow and rat will be invalid.

How to fix it?
I have not yet invented any solution =)

The text was updated successfully, but these errors were encountered:

andrew-aladev · 2018-11-25T17:20:33Z

We want to reset when (s + m) / (d + n) < s / d.
m is new source bytes m >= 10000, n is new destination bytes length.
We can add special condition to exclude incident when s == 0 or d == 0 or n == 0.
So s > 0, d > 0 and n > 0.

(s + m) / (d + n) - s / d < 0.
Let's multiply by d (d + n) > 0.
sd + md - sd - sn < 0.
md - sn < 0
md < sn

So we can drop inaccurate division. Than I will use GMP to implement it. 2 multiplication with GMP for each 10000 bytes is not a big deal. I can implement it in ncompress too. But would you accept new GMP dependency?

andrew-aladev · 2018-11-29T21:27:56Z

I've found a small issue:

outbuf[0] = MAGIC_1;
outbuf[1] = MAGIC_2;
outbuf[2] = (char)(maxbits | block_mode);
boff = outbits = (3<<3);

rat = (bytes_in << 8) / (bytes_out+(outbits>>3));

3 header bytes is not related to compression. You forget to remove them from bytes_out.

#define HEADER_BYTES_LENGTH 3
rat = (bytes_in << 8) / (bytes_out + (outbits >> 3) - HEADER_BYTES_LENGTH)

Main condition is a bit wrong:

if (!stcode && bytes_in >= checkpoint && fcode.e.ent < FIRST)

I think it should not be possible for us to go into with clear code. We can just use 256:

if (!stcode && bytes_in >= checkpoint && fcode.e.ent < 256)

andrew-aladev · 2018-11-29T22:47:36Z

I couldn't decode this "bomb":

boff = outbits = (outbits-1)+((n_bits<<3)-((outbits-boff-1+(n_bits<<3))%(n_bits<<3)));

It looks impossible.

andrew-aladev · 2018-11-29T22:51:46Z

@vapier, Hello, could you please describe what is boff?

Let's try to simplify it.

boff = outbits = (outbits-1)+((n_bits<<3)-((outbits-boff-1+(n_bits<<3))%(n_bits<<3)));

size_t n = n_bits * 8; // Why should we multiply `n_bits` by 8 instead of division? I don't know.
size_t some_bits = outbits - 1 + n;
size_t new_bits = some_bits - (some_bits - boff) % n;

boff = outbits = new_bits;

Can I just ignore this formula and continue implementing lzw? No.

Let's modify code a bit.

fprintf(stderr, "clear, source length: %lu, destination length: %lu\n", bytes_in, bytes_out + (outbits >> 3) - 3);

ratio = 0;
clear_htab();
output(outbuf,outbits,CLEAR,n_bits);

size_t n = n_bits << 3;
size_t some_bits = outbits - 1 + n;
size_t new_bits = some_bits - (some_bits - boff) % n;
fprintf(stderr, "clear, surprise offset (bits): %ld\n", new_bits - outbits);
boff = outbits = new_bits;

extcode = MAXCODE(n_bits = INIT_BITS)+1;
free_ent = FIRST;
stcode = 1;

Now let's try to compress example.

wget "http://norvig.com/big.txt"
head -c 571881 big.txt > fit.txt
lzws-cli < fit.txt > fit_1.txt.Z

compress < fit.txt > fit_2.txt.Z
clear, source length: 571880, destination length: 223882
clear, surprise offset (bits): 32

tail -c 15 fit_1.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[201, 251, 3, 1, 247, 39, 17, 44, 199, 6, 0, 1, 40, 194, 0]

tail -c 15 fit_2.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[247, 39, 17, 44, 199, 6, 0, 1, 0, 0, 0, 0, 40, 194, 0]

You can see 32 zero bits (4 bytes). My lzws-cli has no magic zeroes.

head -c 1146861 big.txt > fit.txt

compress < fit.txt > fit_2.txt.Z
clear, source length: 1146860, destination length: 454566
clear, surprise offset (bits): 64

tail -c 15 fit_2.txt.Z | ruby -e "pp STDIN.read.unpack('C*')"
[121, 70, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 97, 196, 0]

You can see 64 zero bits (8 bytes) without any reason.

I've found version 4.0 from 2001 year here.
Second archive:

} else {
  ratio = 0;

  #ifdef DEBUG
  if(verbose)
    dump_tab();	/* dump string table */
  #endif

  cl_hash ( (count_int) hsize );

  free_ent = FIRST;
  clear_flg = 1;
  output ( (code_int) CLEAR );
  #ifdef DEBUG
  if(debug)
    fprintf ( stderr, "clear\n" );
  #endif /* DEBUG */
}

So this surprise has been planted between 4.0 (2001) and 4.2.4 (2006) year.
Nobody knows what does it mean but it is required.
Both ncompress and gzip unlzw can't decompress content without this surprise offset because of formula inside them.

vapier · 2019-01-04T10:54:05Z

i don't think the code is obfuscated so much as it was written decades ago when coding standards were significantly different. advice on the original implementation is pretty much lost at this point so your guess is probably as good as mine.

iirc, the gzip & ncompress had a lot of code/algorithm copied between them log ago which is why they're similar.

andrew-aladev · 2019-01-14T22:49:08Z

@vapier, I've just finished implementing decompressor and I can confirm that I can't decompress ncompress output because of zeroes provided by formula above. I wan't ever touch this formula if it wan't provide great issue.

Please use example provided above about big.txt. From decompressor point of view: I am receiving zeroes after clear code without any reason. Ncompress outputs 16 bit clear code, than random amount of zeroes. I've reproduced 4 and 8 bytes zeroes.

I can fix this issue in ncompress by removing this formula away but new version of ncompress won't be compatible with previous version in production and nobody will accept such pull request.

So we can't fix this issue in ncompress. We have to document this issue as known bug. We need to find a way for new software to detect that data was compressed by ncompress with broken formula. Than decompressor should skip zeroes. Compressor have to output same amount of zeroes.

andrew-aladev mentioned this issue Nov 30, 2018

compression with 9 bits don't work #5

Closed

vapier added the question label Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ratio implementation overflows? #17

Ratio implementation overflows? #17

andrew-aladev commented Nov 25, 2018 •

edited

Loading

andrew-aladev commented Nov 25, 2018 •

edited

Loading

andrew-aladev commented Nov 29, 2018

andrew-aladev commented Nov 29, 2018

andrew-aladev commented Nov 29, 2018 •

edited

Loading

vapier commented Jan 4, 2019

andrew-aladev commented Jan 14, 2019

Ratio implementation overflows? #17

Ratio implementation overflows? #17

Comments

andrew-aladev commented Nov 25, 2018 • edited Loading

andrew-aladev commented Nov 25, 2018 • edited Loading

andrew-aladev commented Nov 29, 2018

andrew-aladev commented Nov 29, 2018

andrew-aladev commented Nov 29, 2018 • edited Loading

vapier commented Jan 4, 2019

andrew-aladev commented Jan 14, 2019

andrew-aladev commented Nov 25, 2018 •

edited

Loading

andrew-aladev commented Nov 25, 2018 •

edited

Loading

andrew-aladev commented Nov 29, 2018 •

edited

Loading