-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase in SnappyOutputStream output size after #82 #100
Comments
nice catch! i'll look at the change that caused the output size increase. |
I've opened #101 to add a failing regression test for this issue. If I have time, I may attempt a fix. I think the problem is that |
Fixed in #102 and released snappy-java-1.1.1.7. Thanks for the test code. It made easy to fix SnappyOutputStream. |
I'll try bumping Spark's snappy-java version to 1.1.1.7 to see if this fixes our shuffle size issue. Thanks for the fix! |
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml
We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml
It appears that the size of the compressed output generated by SnappyOutputStream increased between versions 1.1.1.1 and 1.1.1.2. To see this, I ran a microbenchmark which serializes 1000 integers using Java serialization, compresses the result using a SnappyOutputStream, and reports the serialized size.
You can find the full source of my benchmark at https://gist.github.com/JoshRosen/f2b568662c3c6011df08. I've included a script that runs my benchmark against all recently-published snappy-java versions. Here are the results:
Based on this, it looks like the compression size got worse between 1.1.1.1 and 1.1.1.2. When I compare the commits between these versions (1.1.1...1.1.1.2), it looks like the only change was #82.
This result might be workload-dependent, so it may be worth investigating this with other benchmarks. I discovered this issue while investigating https://issues.apache.org/jira/browse/SPARK-5081, a Spark bug in which the size of shuffle data increased across Spark versions.
The text was updated successfully, but these errors were encountered: