Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase in SnappyOutputStream output size after #82 #100

Closed
JoshRosen opened this issue Apr 9, 2015 · 4 comments
Closed

Increase in SnappyOutputStream output size after #82 #100

JoshRosen opened this issue Apr 9, 2015 · 4 comments

Comments

@JoshRosen
Copy link
Contributor

It appears that the size of the compressed output generated by SnappyOutputStream increased between versions 1.1.1.1 and 1.1.1.2. To see this, I ran a microbenchmark which serializes 1000 integers using Java serialization, compresses the result using a SnappyOutputStream, and reports the serialized size.

You can find the full source of my benchmark at https://gist.github.com/JoshRosen/f2b568662c3c6011df08. I've included a script that runs my benchmark against all recently-published snappy-java versions. Here are the results:

1.1.1.6    489
1.1.1.5    489
1.1.1.4
1.1.1.3    489
1.1.1.2    489
1.1.1.1    386
1.1.1    386
1.1.1-M4    386
1.1.1-M3    386
1.1.1-M2    386
1.1.1-M1    386
1.1.0.1    386
1.1.0    386
1.1.0-M4    386
1.1.0-M3    386
1.1.0-M2    386
1.1.0-M1    386
1.0.x
1.0.5.4    386
1.0.5.3    386
1.0.5.2    386
1.0.5.1    386
1.0.5    386
1.0.5-M4    386
1.0.5-M3    386
1.0.5-M2    386
1.0.5-M1    386

Based on this, it looks like the compression size got worse between 1.1.1.1 and 1.1.1.2. When I compare the commits between these versions (1.1.1...1.1.1.2), it looks like the only change was #82.

This result might be workload-dependent, so it may be worth investigating this with other benchmarks. I discovered this issue while investigating https://issues.apache.org/jira/browse/SPARK-5081, a Spark bug in which the size of shuffle data increased across Spark versions.

@xerial
Copy link
Owner

xerial commented Apr 9, 2015

nice catch!

i'll look at the change that caused the output size increase.

JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015
JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015
JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue Apr 13, 2015
@JoshRosen
Copy link
Contributor Author

I've opened #101 to add a failing regression test for this issue. If I have time, I may attempt a fix. I think the problem is that rawWrite() seems to call compressInput() even in cases where the buffer has not been filled; see https://github.com/xerial/snappy-java/pull/83/files#diff-d42f4d3946d5e119272e1fc3a9fef168R222

@xerial
Copy link
Owner

xerial commented Apr 14, 2015

Fixed in #102 and released snappy-java-1.1.1.7. Thanks for the test code. It made easy to fix SnappyOutputStream.

@xerial xerial closed this as completed Apr 14, 2015
@JoshRosen
Copy link
Contributor Author

I'll try bumping Spark's snappy-java version to 1.1.1.7 to see if this fixes our shuffle size issue. Thanks for the fix!

asfgit pushed a commit to apache/spark that referenced this issue Apr 14, 2015
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.
asfgit pushed a commit to apache/spark that referenced this issue Apr 14, 2015
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

(cherry picked from commit 6adb8bc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	pom.xml
asfgit pushed a commit to apache/spark that referenced this issue Apr 14, 2015
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

(cherry picked from commit 6adb8bc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	pom.xml
mingyukim pushed a commit to palantir/spark that referenced this issue Apr 17, 2015
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

(cherry picked from commit 6adb8bc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	pom.xml
JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015
JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015
JoshRosen added a commit to JoshRosen/snappy-java that referenced this issue May 14, 2015
mccheah pushed a commit to palantir/spark that referenced this issue May 15, 2015
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

(cherry picked from commit 6adb8bc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	pom.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants