Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Kryo to 3.0.1 #230

Closed
wants to merge 8 commits into from
Closed

Update Kryo to 3.0.1 #230

wants to merge 8 commits into from

Conversation

figpope
Copy link
Contributor

@figpope figpope commented Jun 7, 2015

This updates Kryo to the 3.0.1 shaded jar and resolves the deprecations this upgrade causes. Specifically:

  • The jar organization was changed from org.esotericsoftware.kryo to org.esotericsoftware
  • output.total() / input.total() now return long values instead of int
  • ObjectInstantiator is now a generic, in order to preserve type information. To retain the original library behavior, ObjectInstantiator[_] and ObjectInstantiator[AnyRef] were used

All tests are passing as far as I can tell. (Is there any other manner of testing other than running sbt +test?) And, assuming the tests are somewhat exhaustive, this should also allow Java 8 compatibility, which can be tested with the following patch:

--- a/project/Build.scala
+++ b/project/Build.scala
@@ -28,8 +28,12 @@ object ChillBuild extends Build {
     ScalariformKeys.preferences := formattingPreferences,

     // Twitter Hadoop needs this, sorry 1.7 fans
-    javacOptions ++= Seq("-target", "1.6", "-source", "1.6", "-Xlint:-options"),
-    javacOptions in doc := Seq("-source", "1.6"),
+    javacOptions ++= (
+      if (scalaVersion.value.startsWith("2.11")) Seq("-target", "1.8", "-source", "1.8", "-Xlint:-options")
+      else Seq("-target", "1.7", "-source", "1.7", "-Xlint:-options")),
+    javacOptions in doc := (
+      if (scalaVersion.value.startsWith("2.11")) Seq("-source", "1.8")
+      else Seq("-source", "1.7")),

     resolvers ++= Seq(
       Opts.resolver.sonatypeSnapshots,

This excludes the requisite JavaDoc updates that are required to move to Java 8.

@figpope figpope mentioned this pull request Jun 7, 2015
@ianoc
Copy link
Collaborator

ianoc commented Jun 8, 2015

Given this is for java8 can you configure travis in this PR so its tested against java 8? -- http://docs.travis-ci.com/user/languages/java/

@@ -47,7 +47,7 @@ public void serialize(Object o) throws IOException {
try {
st.writeObject(o);
// Copy from buffer to output stream.
outputStream.writeInt(st.numOfWrittenBytes());
outputStream.writeLong(st.numOfWrittenBytes());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to write 4 extra bytes for each record, right? Even 4 bytes is more than we need, as we don't expect 4 GB records.

Why not cast this to an int here and save the data (consider serializing Unit with this, which will now take twice as much data as it did before).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a legit use case, should probably move this to a varInt encoding, would probably be a win over our usual int in some cases.

@johnynek
Copy link
Collaborator

This makes me a bit nervous unfortunately. Maybe someone should help me through my anxiety. Here is the issue: storm, spark, scalding and heron all put Kryo on the classpath. Upgrading here means there is a real possibility for diamonds and lots of binary pain if you have two of them in scope at the same time. The easiest way (for us) for that to happen is via summingbird since that will have Kryo from chill and from storm/heron.

How do we get out of this mess?

* <p>Signed values are further encoded using so-called zig-zag encoding
* in order to make them "compatible" with variable-length encoding.</p>
*/
final class Varint {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need tests for this. This is super testable with scalacheck.

@steveloughran
Copy link

Kryo is one of the JAR version troublespots -to deal with 2.21 & 2.22 import/link incpompatibilies spark 1.5 has to have a custom version of hive (downgraded against 2.21); ideally hive (unshaded) and spark would use the same release -but that would only work if chill used the same version (2.21+)

Before a big switch to 3.x, would a smaller upgrade to 2.22 be possible? Maybe even 2.24 with some co-ordination across Hive and Spark for a consistent version

@ianoc
Copy link
Collaborator

ianoc commented Sep 15, 2015

The apis alas changed enough between those versions that the binary incompatibilities problems would drive lots of pain everywhere. Going direct to 3 seems better to me than repeating the process multiple times (given the level of coordination pain this will drive).

I actually asked in June about some coordination on the spark dev list but got no response there. There is also the likes of Storm too. Unfortunately this bleeds everywhere pretty quickly.

@ozawa-hi
Copy link

I'm trying to get Spark to upgrade their kryo to 3. It seems there's a recent issue at HIVE too. Would it be better if I try to create a Spark patch for 3.0.1 or 3.0.3?
https://issues.apache.org/jira/browse/HIVE-12175

@ozawa-hi
Copy link

It seems there was an issue where upgrading Kryo in Spark would solve the issue, but it was stopped because of Chill. It would be great if we can coordinate activities.
https://issues.apache.org/jira/browse/SPARK-7708

@steveloughran
Copy link

@ozawa-hi —we really need to get hive & spark to be in sync; it's the sole reason there's a mutant hive-exec JAR for spark. If Hive and Chill can agree on a version, and it doesn't cause regressions in Spark, then it would sync things up nicely

@ianoc
Copy link
Collaborator

ianoc commented Nov 11, 2015

I had asked on the spark dev list a few months ago for a point person on that side to coordinate an upgrade with... but got no response. Scalding/Summingbird need to synchronize on the big upgrade too, but I can drive most of that. Do we have a point person that can vet and drive any spark upgrade volunteer?

@steveloughran
Copy link

...I'm not a person who can help with Spark updates, but could work to keep the Hive code in sync (they've generally been on the cutting edge of Kryo issues)

@ozawa-hi
Copy link

I'm not a Spark committer but is strongly willing to put all effort to get this upgrade because I desperately need it.
The funny thing is, there's an issue at Spark saying they can't version up Kryo because it's an uphill battle to do a version up with Chill. :-)
https://issues.apache.org/jira/browse/SPARK-7708

@ianoc
Copy link
Collaborator

ianoc commented Nov 11, 2015

I dunno, I've not said no to any of the main spark committers that I'm aware of. We've generally moved slowly due to the wide web of dependencies around this. Especially as the version of kryo here now is right before some major binary version breakage (In which the versions after it, a long time ago now, had some bugs so we had to go back). But if someone wants to take over this PR (its got conflicts now), bring it up to the latest version of kryo released I think we can get the ball rolling and try get an updated release of chill out soon

@steveloughran
Copy link

One q: how much coordinated update is needed? If a version of chill ships with Kryo 3, spark doesn't have to immediately update that until they want to. Same for Hive. One thing that will be trouble is if the kryo APIs change signficantly & Hive moves to Kryo 3.x, as then the ugly hacks needed to get spark and hive consistent (rebuilt hive with different kryo version, re-group-ID'd JAR without shading), won't work: upgrading is the only option.

That would be an issue independent of Chill —a version of chill would simply make it possible to get hive and spark in sync.

@steveloughran
Copy link

FWIW, I have a branch of Chill using Kryo 2.2; this is the version hive updated to for bug fixes ... they didn't find any problems worth rolling back for

https://github.com/steveloughran/chill/tree/feature/support-kryo-2.22

@ianoc
Copy link
Collaborator

ianoc commented Nov 11, 2015

So if the versions are not binary compatible then code cannot depend on new chill. Or say versions of algebird serializers can become a problem. Or for us we have a monorepo, so taking 1 version of kryo via our chill dep, then another via our spark deps, and other via our scalding ones is a mess alas.

@ozawa-hi
Copy link

So, it may better to upgrade Kryo in Chill first, then Hive, then Spark.

@steveloughran
Copy link

Hive will update on its own schedule, with its own needs.. looks like Hive 2 has the update down.

In Hive 1.2.1 it is already ahead of Chill & spark —for Spark 1.5.1 we had to patch it to downgrade (POM file and a couple of source files with changed import statements); and to not shade kryo.

Its spark that needs to pick up a changed kryo via chill; then the resyncing process can begin again.

I've just pushed up the patch #244 which patches chill for Kryo-2.22 instead; makes it consistent with Hive's current release. This is a lower risk update, which means that it's more likely to be picked up by spark in the 1.x line after a chill update; its clear that a kryo 3.x update won't happen until spark 2

(as a colleague who works on Hive once said to me, "Kryo is the guava of serialization")

@ianoc
Copy link
Collaborator

ianoc commented Nov 13, 2015

Upgrading to just support hive isn't a big win for most people and huge cost in effort. We should just be trying to get onto the latest stable kryo. We've tried a kryo 2.22 update, its was lots and lots of pain for little benefit, there was a chill release with it, long since reverted.

@steveloughran
Copy link

didn't know about that update; will cancel the pull req. If even a minor release hurts that much, you may as well go to the latest version you are happy with, skipping the intermediate

@ozawa-hi
Copy link

We've tried kryo2.22 but it really didn't solve the issues. We're currently using kryo 3.0 in our projects and it seems to be usable. Java is the default serdes for Spark and I don't think too many people are switching to use kryo as it is - we've found 2.22 to be insufficient in resolving the issues and not worth the pains to maintain it.

Kryo3.x won't happen before Spark 2.x but as you've already mentioned, they're a lot of work and it's probably better to start early. I'm not just suggesting an upgrade because there's a newer version out there - I really need to to use kryo because we need performance and kryo 2.x just isn't usable..

@ianoc ianoc mentioned this pull request Nov 16, 2015
@figpope figpope closed this Dec 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants