Increase in SnappyOutputStream output size after #82 #100

JoshRosen · 2015-04-09T02:27:55Z

It appears that the size of the compressed output generated by SnappyOutputStream increased between versions 1.1.1.1 and 1.1.1.2. To see this, I ran a microbenchmark which serializes 1000 integers using Java serialization, compresses the result using a SnappyOutputStream, and reports the serialized size.

You can find the full source of my benchmark at https://gist.github.com/JoshRosen/f2b568662c3c6011df08. I've included a script that runs my benchmark against all recently-published snappy-java versions. Here are the results:

1.1.1.6    489
1.1.1.5    489
1.1.1.4
1.1.1.3    489
1.1.1.2    489
1.1.1.1    386
1.1.1    386
1.1.1-M4    386
1.1.1-M3    386
1.1.1-M2    386
1.1.1-M1    386
1.1.0.1    386
1.1.0    386
1.1.0-M4    386
1.1.0-M3    386
1.1.0-M2    386
1.1.0-M1    386
1.0.x
1.0.5.4    386
1.0.5.3    386
1.0.5.2    386
1.0.5.1    386
1.0.5    386
1.0.5-M4    386
1.0.5-M3    386
1.0.5-M2    386
1.0.5-M1    386

Based on this, it looks like the compression size got worse between 1.1.1.1 and 1.1.1.2. When I compare the commits between these versions (1.1.1...1.1.1.2), it looks like the only change was #82.

This result might be workload-dependent, so it may be worth investigating this with other benchmarks. I discovered this issue while investigating https://issues.apache.org/jira/browse/SPARK-5081, a Spark bug in which the size of shuffle data increased across Spark versions.

The text was updated successfully, but these errors were encountered:

xerial · 2015-04-09T02:43:07Z

nice catch!

i'll look at the change that caused the output size increase.

JoshRosen · 2015-04-13T21:26:25Z

I've opened #101 to add a failing regression test for this issue. If I have time, I may attempt a fix. I think the problem is that rawWrite() seems to call compressInput() even in cases where the buffer has not been filled; see https://github.com/xerial/snappy-java/pull/83/files#diff-d42f4d3946d5e119272e1fc3a9fef168R222

xerial · 2015-04-14T07:06:53Z

Fixed in #102 and released snappy-java-1.1.1.7. Thanks for the test code. It made easy to fix SnappyOutputStream.

JoshRosen · 2015-04-14T17:29:47Z

I'll try bumping Spark's snappy-java version to 1.1.1.7 to see if this fixes our shuffle size issue. Thanks for the fix!

We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml

We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015

Add a failing regression test for issue xerial#100.

96a5909

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015

Add failing regression test for xerial#100.

42e3ccc

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015

Add failing regression test for xerial#100.

a6eb0a6

JoshRosen mentioned this issue Apr 13, 2015

Add failing regression test for #100 #101

Merged

xerial added a commit that referenced this issue Apr 14, 2015

Fixes for #100

6d9925b

xerial mentioned this issue Apr 14, 2015

Stabilize compressed data size of SnappyOutputStream #102

Merged

xerial closed this as completed Apr 14, 2015

JoshRosen mentioned this issue Apr 14, 2015

[SPARK-6905] Upgrade to snappy-java 1.1.1.7 apache/spark#5512

Closed

ntolia mentioned this issue May 13, 2015

Change in size of compressed arrays between versions #106

Closed

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015

Add failing regression test for xerial#100.

e349e5e

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015

Add failing regression test for xerial#100.

1285253

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015

Add failing regression test for xerial#100.

11f08de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase in SnappyOutputStream output size after #82 #100

Increase in SnappyOutputStream output size after #82 #100

JoshRosen commented Apr 9, 2015

xerial commented Apr 9, 2015

JoshRosen commented Apr 13, 2015

xerial commented Apr 14, 2015

JoshRosen commented Apr 14, 2015

Increase in SnappyOutputStream output size after #82 #100

Increase in SnappyOutputStream output size after #82 #100

Comments

JoshRosen commented Apr 9, 2015

xerial commented Apr 9, 2015

JoshRosen commented Apr 13, 2015

xerial commented Apr 14, 2015

JoshRosen commented Apr 14, 2015