Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86_64 deflate_quick fails on chromium-benchmark fireworks.jpeg #284

Closed
sebpop opened this issue Jan 17, 2019 · 27 comments
Closed

x86_64 deflate_quick fails on chromium-benchmark fireworks.jpeg #284

sebpop opened this issue Jan 17, 2019 · 27 comments
Labels

Comments

@sebpop
Copy link
Contributor

sebpop commented Jan 17, 2019

In the past I have seen some errors on x86_64 with the snappy chromium-benchmark at compression level 1. Check that these errors are fixed.

@sebpop
Copy link
Contributor Author

sebpop commented Jan 17, 2019

I still see it failing at level 1 with

fireworks.jpeg                           : compress failed (0)

@sebpop
Copy link
Contributor Author

sebpop commented Jan 17, 2019

It passes on aarch64 and fails on x86_64 which points to some bug in intel's patches for deflate_quick.

@sebpop sebpop changed the title make sure chromium-benchmark works at compression level 1 x86_64 deflate_quick fails on chromium-benchmark fireworks.jpeg Jan 17, 2019
@sebpop
Copy link
Contributor Author

sebpop commented Jan 17, 2019

When replacing deflate_quick with deflate_fast it passes.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jan 17, 2019

It would be good to know why deflate_quick fails... That pretty much requires enabling debug output and possibly adding more sanity checks.

@sebpop
Copy link
Contributor Author

sebpop commented Jan 17, 2019

The benchmark uses an estimate of the compress size to resize a buffer before decompressing.
In the case of fireworks.jpeg with deflate_quick we grow the output by 5.2%:

test/bench/snappy/fireworks.jpeg         : GZIP -1: [b 1M] bytes 123093 -> 129548 105.2% comp  34.9 ( 34.9) MB/s uncomp 139.7 (139.7) MB/s

The benchmark passes if we set the estimated compression size to double of the input:

 z_size_t ZEXPORT PREFIX(compressBound)(z_size_t sourceLen) {
-    return sourceLen + (sourceLen >> 12) + (sourceLen >> 14) + (sourceLen >> 25) + 13;
+    return 2 * sourceLen;

When printing the original estimated size, we see far less than 5%:

 zng_compressBound (sourceLen=123093)
(gdb) finish
Value returned is $7 = 123143

The fix is to special case intel's deflate_quick to have a different function to estimate its output buffer size.

@sebpop
Copy link
Contributor Author

sebpop commented Jan 17, 2019

There is at least one more bug that we see with lcet10.txt and others that returns -3 which is different than the jpeg which was returning 0 instead of 1.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jan 18, 2019

Maybe we should adjust one of the shifts to make it return slightly larger value... That way we don't completely butcher the logic inside compressBound...

    return sourceLen + (sourceLen >> 4) + (sourceLen >> 12) + (sourceLen >> 14) + (sourceLen >> 25) + 13;

This should allow 6.25~6.28% increase.

@bmrzycki
Copy link
Contributor

I did some further investigation and it's not entirely clear to me who's at fault. Here's my patch to alter the method that the chromium zlib benchmark compresses data that passes all tests and requires no changes to zlib-ng.

--- a/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
+++ b/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
@@ -81,7 +81,10 @@ Data read_file_data_or_exit(const char* name) {
 }
 
 size_t zlib_estimate_compressed_size(size_t input_size) {
-  return PREFIX(compressBound)(input_size);
+  // Be conservative. Some algorithms like Intel's deflate_quick
+  // (x86 level 1) may make a file larger than the original if
+  // already compressed (eg. jpeg).
+  return PREFIX(compressBound)(input_size * 3);
 }
 
 enum zlib_wrapper {
@@ -113,6 +116,14 @@ const char* zlib_wrapper_name(zlib_wrapper type) {
   return 0;
 }
 
+// Mimic what minigzip.c does, uses 8 for memory level to deflateInit2.
+#ifndef DEF_MEM_LEVEL
+# define DEF_MEM_LEVEL 8
+#endif
+
+// Mimic minigzip.c, deflate in chunk sizes of 16384 * 3.
+#define BUFLENW 49152
+
 void zlib_compress(
     const zlib_wrapper type,
     const char* input,
@@ -129,24 +140,56 @@ void zlib_compress(
   memset(&stream, 0, sizeof(stream));
 
   int result = PREFIX(deflateInit2)(&stream, compression_level, Z_DEFLATED,
-      zlib_stream_wrapper_type(type), MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY);
+      zlib_stream_wrapper_type(type), DEF_MEM_LEVEL, Z_DEFAULT_STRATEGY);
   if (result != Z_OK)
     error_exit("deflateInit2 failed", result);
 
-  stream.next_out = (unsigned char *)string_data(output);
-  stream.avail_out = (uint32_t)output_size;
-  stream.next_in = (const unsigned char *)input;
-  stream.avail_in = (uint32_t)input_size;
+  int in_off = 0;
+  unsigned char *in = (unsigned char *)input;
+  size_t remain = input_size;
+
+  int out_off = 0;
+  unsigned char *out = (unsigned char *)string_data(output);
+  size_t out_end = (size_t)out + output_size;
+
+  for (;;) {
+    if (out_end < ((size_t)out + out_off + BUFLENW))
+      error_exit("not enough space in output buffer!", 1);
+
+    stream.next_out = (out + out_off);
+    stream.avail_out = BUFLENW;
+    stream.next_in = (in + in_off);
+    stream.avail_in = remain;
+
+    result = PREFIX(deflate)(&stream, Z_NO_FLUSH);
+    if (result != Z_OK)
+      error_exit("deflate did not return Z_OK!", result);
+
+    out_off += (BUFLENW - stream.avail_out);
+    in_off += (remain - stream.avail_in);
+    remain = stream.avail_in;
+    if (remain == 0)
+      break;
+  }
+
+  if (out_end < ((size_t)out + out_off + BUFLENW))
+    error_exit("not enough space in output buffer!", 1);
+
+  stream.next_out = (out + out_off);
+  stream.avail_out = BUFLENW;
+  stream.next_in = NULL;
+  stream.avail_in = 0;
 
   result = PREFIX(deflate)(&stream, Z_FINISH);
-  if (result == Z_STREAM_END)
-    output_size = stream.total_out;
-  result |= PREFIX(deflateEnd)(&stream);
   if (result != Z_STREAM_END)
-    error_exit("compress failed", result);
+    error_exit("compress failed Z_FINISH", result);
+
+  result = PREFIX(deflateEnd)(&stream);
+  if (result != Z_OK)
+    error_exit("compress failed end", result);
 
   if (resize_output)
-    output->resize(output_size);
+    output->resize(stream.total_out);
 }
 
 void zlib_uncompress(

This patch was inspired by minigzip.c's method of decompressing in chunks of 3 * 16KB instead of attempting to do it all in one call. This patch passes for all the input files at level 1 and level 9 with all 3 wrapper types.

In the case of zlib_estimate_compressed_size() I had to multiply by 3 for a baddata input file, 2 wasn't enough. At the very least compressBound is not sufficient for Intel's deflate_quick() algorithm and there are some cases where the compressed data is larger than the input file.

It's also a bit unexpected that all other algorithm variant has no trouble with PREFIX(deflate)(&stream, Z_NO_FLUSH) except deflate_quick(). If this is a supported method of using the zlib library then I have concerns there's a bug in the library code when called on this path.

@sebpop and I can try to push for changes to the benchmark but I'm worried that may just be hiding a deeper set of bugs.

@Dead2 Dead2 added the bug label Jan 21, 2019
@bmrzycki
Copy link
Contributor

I've made a small change to my previous patch and everything continues to pass:

--- a/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
+++ b/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
@@ -80,8 +80,16 @@ Data read_file_data_or_exit(const char* name) {
   return data;
 }

+// Mimic minigzip.c, deflate in chunk sizes of 16384 * 3.
+#define BUFLENW 49152
+
 size_t zlib_estimate_compressed_size(size_t input_size) {
-  return PREFIX(compressBound)(input_size);
+  // Be conservative. Some algorithms like Intel's deflate_quick
+  // (x86 level 1) may make a file larger than the original if
+  // already compressed (eg. jpeg).
+  // We work in BUFLENW blocks, make sure we have at least 2
+  // full blocks worth of room for padding.
+  return PREFIX(compressBound)(input_size) + (2 * BUFLENW);
 }

The big difference is I am now using zlib-ng's compressBound() value, just padding it with 2 additional write buffers. I've tried this on large files >200MB and do not see errors with the Google framework. Minigzip doesn't have to do this because it doesn't keep a copy in memory, it just streams the same chunk over and over to a filehandle.

@sebpop
Copy link
Contributor Author

sebpop commented Jan 25, 2019

Some algorithms like Intel's deflate_quick
(x86 level 1) may make a file larger than the original if
already compressed (eg. jpeg).

I think the change in behavior needs to be documented for the compression algorithm on x86 at level 1.
Where should we put this documentation?
In the past we spoke about documenting the changes made in zlib-ng compared to madler/zlib.
What about adding a new file in doc/ or a top level ChangeLog.zlib-ng ?

@bmrzycki
Copy link
Contributor

For those interested, here is my latest patch to the Google zlib benchmark. It is less prone to failure, especially if you alter the MEM_LEVEL passed to deflateInit2(). I am now using MAX_MEM_LEVEL like the original code did and do not alter zlib_estimate_compressed_size() other than adding two BUFLENW buffers of padding.

--- a/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
+++ b/zlib/ng_snappy/src/test/bench/chromium_zlib_bench.cc
@@ -80,8 +80,15 @@ Data read_file_data_or_exit(const char* name) {
   return data;
 }

+// Mimic minigzip.c, deflate in chunk sizes of 16384 * 3.
+#define BUFLENW 49152
+
 size_t zlib_estimate_compressed_size(size_t input_size) {
-  return PREFIX(compressBound)(input_size);
+  // Some algorithms like Intel's deflate_quick (x86 level 1) may make a file
+  // larger than the original if already compressed (eg. jpeg). Add two
+  // write buffers on the end for padding and to ensure there is space
+  // for Z_FINISH.
+  return PREFIX(compressBound)(input_size) + (BUFLENW * 2);
 }

 enum zlib_wrapper {
@@ -133,20 +140,60 @@ void zlib_compress(
   if (result != Z_OK)
     error_exit("deflateInit2 failed", result);

-  stream.next_out = (unsigned char *)string_data(output);
-  stream.avail_out = (uint32_t)output_size;
-  stream.next_in = (const unsigned char *)input;
-  stream.avail_in = (uint32_t)input_size;
+  int in_off = 0;
+  unsigned char *in = (unsigned char *)input;
+  size_t remain = input_size;

-  result = PREFIX(deflate)(&stream, Z_FINISH);
-  if (result == Z_STREAM_END)
-    output_size = stream.total_out;
-  result |= PREFIX(deflateEnd)(&stream);
-  if (result != Z_STREAM_END)
-    error_exit("compress failed", result);
+  int out_off = 0;
+  unsigned char *out = (unsigned char *)string_data(output);
+  size_t out_end = (size_t)out + output_size;
+
+  for (;;) {
+    if (out_end < ((size_t)out + out_off + BUFLENW))
+      error_exit("not enough space in output buffer!", 1);
+
+    stream.next_out = (out + out_off);
+    stream.avail_out = BUFLENW;
+    stream.next_in = (in + in_off);
+    stream.avail_in = remain;
+
+    result = PREFIX(deflate)(&stream, Z_NO_FLUSH);
+    if (result != Z_OK)
+      error_exit("deflate did not return Z_OK!", result);
+
+    out_off += (BUFLENW - stream.avail_out);
+    in_off += (remain - stream.avail_in);
+    remain = stream.avail_in;
+    if (remain == 0)
+      break;
+  }
+
+  // No more input data to process, iterate until we see Z_FINISH.
+  stream.next_in = NULL;
+  stream.avail_in = 0;
+  for(;;) {
+      if (out_end < ((size_t)out + out_off + BUFLENW))
+         error_exit("not enough space in output buffer!", 1);
+
+      stream.next_out = (out + out_off);
+      stream.avail_out = BUFLENW;
+
+      result = PREFIX(deflate)(&stream, Z_FINISH);
+      if (result == Z_STREAM_END) {
+         break;
+      } else if (result == Z_OK) {
+         out_off += (BUFLENW - stream.avail_out);
+      } else {
+         error_exit("compress failed Z_FINISH", result);
+      }
+  }
+
+  result = PREFIX(deflateEnd)(&stream);
+  if (result != Z_OK)
+    error_exit("compress failed end", result);

   if (resize_output)
-    output->resize(output_size);
+    output->resize(stream.total_out);
 }

 void zlib_uncompress(

@Myriachan
Copy link

Why can't this be fixed instead by changing deflateBound and compressBound to always include this extra size #if x86?

@bmrzycki
Copy link
Contributor

Hello @Myriachan , I think that's one option. My concern with such a patch is if it masks a deeper bug or issue. I cant say with confidence if x86 at level 1 is operating correctly and don't have the deeper knowledge of Intel's deflate_quick overall algorithm.

@Dead2
Copy link
Member

Dead2 commented Aug 21, 2019

Does PR #382 affect this problem at all? It seems like a real bug in deflate_quick, but I am not clear on whether it fixes this problem or not.

@bmrzycki
Copy link
Contributor

Hello @Dead2 . I rebased my internal test framework to c0f0acf after seeing PR #382 was merged to tip. I then re-ran bench/chromium_zlib_bench.cc without any of my above patches to prevent failures. When running I still see the same failure message:

compress failed (0)
fireworks.jpeg 

@Dead2
Copy link
Member

Dead2 commented Aug 23, 2019

@bmrzycki Thank you for checking, I really appreciate it :)

I am hoping that this bug is related to #390, since that seems to erroneously emit an extra end block each time it runs out of data in the in-buffer.
At least that offers a logical explanation to this bug. I have otherwise been unable to understand how deflate_quick could deliver worse worst-case compression than the algorithm itself explains (without delivering corrupt files), but I had not thought about the possibility of it emitting extra metadata blocks other than at the start and end.

@bmrzycki
Copy link
Contributor

@Dead2 you're welcome. :)

I manually patched c0f0acf with the code from PR #390 :

$ git diff
diff --git a/zlib/ng_snappy/src/arch/x86/deflate_quick.c b/zlib/ng_snappy/src/arch/x86/deflate_quick.c
index ce1f52f5..ac13a714 100644
--- a/zlib/ng_snappy/src/arch/x86/deflate_quick.c
+++ b/zlib/ng_snappy/src/arch/x86/deflate_quick.c
@@ -238,7 +238,7 @@ ZLIB_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
         if (s->lookahead < MIN_LOOKAHEAD) {
             fill_window_sse(s);
             if (s->lookahead < MIN_LOOKAHEAD && flush == Z_NO_FLUSH) {
-                static_emit_end_block(s, 0);
+                /* static_emit_end_block(s, 0); -- BMR test PR 390 */
                 return need_more;
             }
             if (s->lookahead == 0)

and I still see the same compress failed (0) message for fireworks.jpeg. It's unclear to me if this is a false path or merely an incomplete patch.

@nmoinvaz
Copy link
Member

nmoinvaz commented Sep 7, 2019

With fireworks.jpeg I am getting the following error using deflate level 1 and mem level 1:

chunkmemset should be called on larger chunks

When I comment out the following line, it works:

#define INFFAST_CHUNKSIZE sizeof(inffast_chunk_t)

@nmoinvaz
Copy link
Member

nmoinvaz commented Sep 9, 2019

@bmrzycki can you test my PR #396 and see if it solves the problem for you?

@bmrzycki
Copy link
Contributor

Hello @nmoinvaz , I applied the code in #396 on top of 5d4d630 :

$ git diff
diff --git a/zlib/ng_snappy/src/inffast.c b/zlib/ng_snappy/src/inffast.c
index 62cb9513..fa8992f4 100644
--- a/zlib/ng_snappy/src/inffast.c
+++ b/zlib/ng_snappy/src/inffast.c
@@ -307,7 +307,9 @@ void ZLIB_INTERNAL zng_inflate_fast(PREFIX3(stream) *strm, unsigned long start)
                        operations can write beyond `out+len` so long as they
                        stay within 258 bytes of `out`.
                     */
-                    if (dist >= len || dist >= INFFAST_CHUNKSIZE)
+                    if (len < sizeof(uint64_t))
+                        out = chunkcopysafe(out, out - dist, len, safe);
+                    else if (dist >= len || dist >= INFFAST_CHUNKSIZE)
                         out = chunkcopy(out, out - dist, len);
                     else
                         out = chunkmemset(out, dist, len);
diff --git a/zlib/ng_snappy/src/inflate.c b/zlib/ng_snappy/src/inflate.c
index fae8847d..2c1dfbd2 100644
--- a/zlib/ng_snappy/src/inflate.c
+++ b/zlib/ng_snappy/src/inflate.c
@@ -978,7 +978,10 @@ int ZEXPORT PREFIX(inflate)(PREFIX3(stream) *strm, int flush) {
                 if (copy > left)
                     copy = left;
 #if defined(INFFAST_CHUNKSIZE)
-                put = chunkcopysafe(put, from, copy, put + left);
+               if (copy >= sizeof(uint64_t))
+                    put = chunkmemsetsafe(put, state->offset, copy, left);
+                else
+                    put = chunkcopysafe(put, put - state->offset, copy, put);
 #else
                 if (copy >= sizeof(uint64_t))
                     put = chunk_memcpy(put, from, copy);

And the test still failed on x86_64 (zlib wrapper):

compress failed (0)
fireworks.jpeg                           :   GO_ALL_SCORES_FILE ==> /work/b.rzycki/single-test/zlib/ng_snappy/b.rzycki_zlibng_zlib_bench.5682-1848-8150/score.out
  TIME_END ==> Wed Sep 11 11:14:57 CDT 2019

@nmoinvaz
Copy link
Member

nmoinvaz commented Mar 4, 2020

Worst case for deflate_quick is no matches found in entire data and max of 9 bits for each literal.

max_output = ((input_size + 17) >> 3) + input_size + 18

That is, 1 bit extra for each byte + 3 bits for block start + 10 bits for block end + 7 bits for round up, converted to bytes + the input size + 18 bytes for gzip header and footer.

Here is a text file that produces this worse case result:
nomatch.txt
nomatch.bin.gz

Buffer size must be about 1.1367 * input_size. I don't think there is anything wrong with the deflate_quick algorithm. It is just emitting the distance and literal codes as it sees them without creating a huffman tree.

@nmoinvaz
Copy link
Member

@sebpop, @bmrzycki can you please retest your use cases against develop branch?

@gdevenyi
Copy link
Contributor

@neurolabusc is still indicating this triggers a segfault in pigz at least: afni/afni#138 (comment)

@nmoinvaz
Copy link
Member

I think that might be a separate issue fixed in #541. This change made in develop was to fix compressBound calculation.

@bmrzycki
Copy link
Contributor

bmrzycki commented Mar 24, 2020

Hello @nmoinvaz. I am unable to perform this testing now. I'm at a new company and do not have formal clearance from management to contribute to this project yet. In the meantime you should be able to test it yourself on any x86_64 Linux host. The code diff to fix the errors is from #284 (comment) on top of the Chromium zlib benchmark C++ code: https://github.com/chromium/chromium/blob/master/third_party/zlib/contrib/bench/zlib_bench.cc . Without the patch we see the failure.

Linking to zlib.a is a simple compile. The zlib_bench is a standalone C++ program:

$CC -O3 -Wall -std=c++11 -c zlib_bench.cc
$CC zlib_bench.o /path/to/zlib-ng/develop/zlib.a -o zlib_bench

The source files for compression/decompression come from Google's snappy testdata directory: https://github.com/google/snappy/tree/master/testdata .

You can execute the test with the following:

zlib_bench gzip --compression 1 fireworks.jpeg

@nmoinvaz
Copy link
Member

Thanks @sebpop. I have tested it on x86 myself but I was looking for independent verification since this issue has been open a long time. Congrats on your new position! Hopefully they will let you contribute.

@nmoinvaz
Copy link
Member

nmoinvaz commented Apr 17, 2020

This should be resolved now that #541 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants