Use reflection over Jobs to find serialized classes #1654

jcdavis · 2017-03-15T22:28:05Z

Using scala reflection, we can look at the container types of TypedPipe/Grouped
etc to identify classes being serialized and automatically assign them
compact cascading tokens instead of writing full names.

A few followups probably needed on this first pass:

Is the correct copyright stanza at the top to just copy + paste, changing the year?
Does my use of a boolean flag to disable make sense? This should definitely be disable-able in the event of issues, but the method of of checking for the flags existence rather than some boolean value seems a little hacky
(Most importantly) What is the proper way of determining a field's class in a Job is a typed scalding container? The current solution of listing a bunch of base types mostly works but is obviously a little hacky and also possibly not complete. Is it worth maybe having a market trait (eg TypedContainer for TypedPipe etc? That doesn't truly change the coverage issue, though
Is it worth recursively examining more than just Tuple2 ? There isn't too much cost in adding additional tokens to the cascading tokens config, even if they aren't used.

Using scala reflection, we can look at the types of TypedPipe/Grouped etc to identify classes being serialized and automatically assign them compact cascading tokens instead of writing full names.

jcdavis · 2017-03-15T22:33:42Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+    * Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't
+    * referred to in a field
+    */
+  def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = {


either this method or the object should be private[scalding]

why do we need jobClass: Class[_ <: Job] rather than outerClass: Class[_]? Why does Job matter?

For instance, I'd like to use this with Execution as well, and we might want to pass an ExecutionApp to get the same result (if you can find items).

Also, could we possibly also walk the methods and look at the input and return types? This could cover all the cases perhaps?

Its not really job-related specifically, thats just what I started with to self-limit usage. I'll change it

I passed on methods for this pr to simplify things, but could probably add them. For methods, we might care about ones that return A in addition to TypedPipe[A]/etc, so you might need to be a little smarter about what you consider/ignore

jcdavis · 2017-03-15T22:39:15Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+    }
+  }
+
+  private def getClassOpt(name: String): Option[Class[_]] = {


Could avoid loading the classes if I refactored CascadingTokenUpdate.updater to take just the names instead of the classes, since it just calls .getName

PSA stringToTermName is deprecated in 2.11, to might need to revert some of this for 2.12 support

oscar-stripe · 2017-03-16T22:06:51Z

scalding-core/src/main/scala/com/twitter/scalding/Config.scala

+  def setSerialization(
+    kryo: Either[(Class[_ <: KryoInstantiator], KryoInstantiator), Class[_ <: KryoInstantiator]],
+    userHadoop: Seq[Class[_ <: HSerialization[_]]] = Nil,
+    additionalSerializedClasses: Set[Class[_]] = Set.empty): Config = {


can we remove the additionalSerializedClasses here? I think callers should just do:

.setSerialization( ... ) .addCascadingClassSerializationTokens(additionalSerializedClasses)

But another concern is that sadly both cascading and kryo need to give tokens to classes. Can we make a kryo registrar:
https://github.com/twitter/chill/blob/develop/chill-java/src/main/java/com/twitter/chill/IKryoRegistrar.java

that registers any classes not already added? Then it can be combined with the KryoInstantiator:
https://github.com/twitter/chill/blob/develop/chill-java/src/main/java/com/twitter/chill/KryoInstantiator.java#L98

I thought about doing Kryo as well but I'm not super familiar on things so I've punted for this pr.

My main confusion/concern is: how does the KryoRegistrar know the job class to inspect (Hadoop config? I don't see the full scalding job name anywhere), or does it parse the cascading token config to find classes? And does this get added to the chill KryoHadoop, or do we add a new instantiator for this? (And then what happens if someone extends KryoHadoop)

oscar-stripe · 2017-03-16T22:07:08Z

scalding-core/src/main/scala/com/twitter/scalding/Job.scala

@@ -185,9 +185,13 @@ class Job(val args: Args) extends FieldConversions with java.io.Serializable {

    val init = base ++ modeConf

+    val usedClasses: Set[Class[_]] = if (args.boolean("scalding.nojobclassreflection")) Set.empty else {


can we make this a private def?

oscar-stripe · 2017-03-16T22:09:38Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+    * Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't
+    * referred to in a field
+    */
+  def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = {


why do we need jobClass: Class[_ <: Job] rather than outerClass: Class[_]? Why does Job matter?

For instance, I'd like to use this with Execution as well, and we might want to pass an ExecutionApp to get the same result (if you can find items).

oscar-stripe · 2017-03-16T22:09:57Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+  private val baseContainers = List(
+    classOf[Execution[_]],
+    classOf[TypedPipe[_]],
+    classOf[TypedSink[_]],


do we not want TypedSource[_]?

oscar-stripe · 2017-03-16T22:10:35Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+    case universe.TypeRef(_, _, args) =>
+      args.flatMap { generic =>
+        //If the wrapped type is a Tuple2, recurse into its types
+        if (generic.typeSymbol.fullName == "scala.Tuple2") {


why not .fullName.startsWith("scala.Tuple") to support all tuples?

No reason beyond reducing the scope, I'll add

oscar-stripe · 2017-03-16T22:12:24Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+    * Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't
+    * referred to in a field
+    */
+  def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = {


Also, could we possibly also walk the methods and look at the input and return types? This could cover all the cases perhaps?

oscar-stripe · 2017-03-16T22:15:49Z

scalding-core/src/main/scala/com/twitter/scalding/JobClassFinder.scala

+
+import scala.reflect.runtime.universe
+
+object JobClassFinder {


I don't think Job has anything really to do with this. Can we just say ReferencedClassFinder or something? Also, we might add a method like:

object ReferencedClassFinder { /** * Add the given type, as well as all referenced types to the cascading tokens list. * note, for maximal efficiency, you should also register those types with the kryo * instantiator being used. */ def addCascadingTokensFrom(c: Class[_], config: Config): Config }

jcdavis · 2017-03-17T17:23:47Z

A small wrinkle here we have to think about: If we have a TypedPipe of a type that has a cascading protected token (so <128, eg BytesWritable) we will assign an a token in the config, that will cause an IllegalStateException when initializing the map in TupleSerialization

There are a few options here, of varying degrees of complication/hackiness:

Explicitly blacklist classes known to have private cascading tokens. AFAICT, that is only BigDecimal (125), Array[Byte] (126), and BytesWritable (127).
Construct a TupleSerialization.SerializationElementReader in CascadingTokenUpdater.update, and then only tokenize classes in the config which don't already have a token by checking getTokenFor
Replicate the logic in TupleSerialization.initTokenMaps which looks over all the HadoopSerializations for the SerializationToken annotation to determine which classes are tokenized. This has the benefit it could be done in Config.getCascadingSerializationTokens, which is a little cleaner

Both 2 and 3 require the Hadoop Serializations config to have been set, which seems like a reasonable requirement based on how the Scalding config is init'ed. 2 and 3 could also protect against either class or token conflicts with other user-defined serializations with the SerializationToken annotation set, although the token conflict issue is already a concern regardless of this PR (and given the brittleness of explicitly-assigned tokens, I'm not sure its a thing that actually gets used?)
Thoughts?

johnynek · 2017-03-22T18:48:23Z

2 sounds pretty good to me.

jcdavis · 2017-03-23T01:06:01Z

TupleSerialization.getTokenFor isn't public, so that strategy didn't work.

Instead, I've replicated the logic of looking over the serializations for the SerializationToken annotation, which avoids the token re-assignment

johnynek

Looks good to me.

johnynek · 2017-03-24T03:17:49Z

What do you think @piyushnarang ?

piyushnarang · 2017-03-24T15:16:42Z

scalding-core/src/main/scala/com/twitter/scalding/ReferencedClassFinder.scala

+    CascadingTokenUpdater.update(config, findReferencedClasses(c) + c)
+  }
+  /**
+   * Reflect over a scalding  to try and identify types it uses so they can be tokenized by cascading.


Reflect over a scalding container type?

piyushnarang · 2017-03-24T15:25:03Z

scalding-core/src/main/scala/com/twitter/scalding/Job.scala

@@ -197,6 +198,10 @@ class Job(val args: Args) extends FieldConversions with java.io.Serializable {
      .toMap.toMap[AnyRef, AnyRef] // linter:ignore the second one is to lift from String -> AnyRef
  }

+  def reflectedClasses: Set[Class[_]] = if (args.boolean("scalding.nojobclassreflection")) Set.empty else {


could we add this to Config.scala: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Config.scala#L431
with details on what users get when they turn this on / off?

Actually now that I look at it, I'm not sure that makes sense? the Config params are for stuff that go into the hadoop config. This is just being read from the Args. Other things being read from args in the Job (eg scalding.nocounters, or scalding.flowstats) aren't referenced in the config.

Agreed I should add documentation somewhere, just not sure if Config seems like the right place

How about we add it to the companion Args object? We don't need to move refactor the other Args in this review but it seems nicer than having them scattered in a bunch of places right?

Args companion object seems pretty reasonable, I'll add it there

piyushnarang · 2017-03-24T15:27:01Z

scalding-core/src/test/scala/com/twitter/scalding/ReferencedClassFinderTest.scala

+case class C3(c: Int)
+case class C4(d: Int)
+
+trait TraitType {


do we need to add some tests for:

Java types?

Thrift / protobuf types?

Testing that if we have some primitive types (like Int) referenced, we don't end up adding their tokens? (Or is that the same scenario as the BytesWriteable?)

1 and 2 should be fine, at least based on my testing.

Primitives seem to work, but in setSerialization it filters out primitives and array types:

val kryoClasses = withKryo.getKryoRegisteredClasses .filterNot(_.isPrimitive) // Cascading handles primitives and arrays .filterNot(_.isArray)

I'll add those filters to findReferencedClasses

could we add tests for these scenarios?

Yea I'll add some tests

also, let's have tests for InnerClasses (Case classes in Object)

piyushnarang · 2017-03-24T15:34:23Z

scalding-core/src/main/scala/com/twitter/scalding/CascadingTokenUpdater.scala

    // We don't want to assign tokens to classes already in the map
-    val newClasses: Iterable[String] = clazzes.map { _.getName } -- toks.values
+    val newClasses: Iterable[String] = clazzes.map { _.getName } -- fromSerializations -- toks.values


should we be checking if the number of these tokens exceeds what cascading supports? (max int?). Wondering if there are some really complex types that end up referencing a large graph of types.

TupleSerialization writes out the token as a varint, so there shouldn't be any concerns about assigning too many tokens

johnynek · 2017-03-24T23:51:11Z

@jcdavis one last thing.

here:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/serialization/KryoHadoop.scala#L32

you could get the registered class list back out of the config. With that list we could make sure at the end of all the registrations that everything named is registered on the resulting kryo. This would make sure that all classes use tokens at the hadoop, cascading and kryo levels (the tokens don't need to match with Kryo, but it would be nice to not write the class names on nested structures).

using code similar to here:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Config.scala#L163

you check if each item in the list has a registration, if not, add it.

isnotinvain · 2017-03-24T23:55:16Z

Sort of related question: Can we re-use some of this to make JobTests try to serialize everything that's going to need serializing, to catch those sorts of errors in unit tests? That'd be super cool.

jcdavis · 2017-03-25T00:20:55Z

@johnynek I have already been testing that for us locally, but for some reason I though KryoHadoop was in chill, thus I was gonna have to PR it there, do some versioning hell, etc etc. This makes that much easier :)

My version locally uses a different config that has classes that are definitely not registered yet, which makes things simpler. I'll do some more looking at the Kryo api, but it seems like I can just call .register(clazz) ? Thats a no-op if the class is already registered, and if not will use the appropriate default serializer.

My one concern is that if someone overrides KryoHadoop with their own instantiator (as we do), they might register default serializers after classes have already been registered. Eg for Thrift,

class MyKryoHadoop(config: Config) extends KryoHadoop(config) {
  override def newKryo(): Kryo = {
    val kryo = super.newKryo()
    kryo.addDefaultSerializer(classOf[TBase[_, _]], new TBaseSerializer)
    ...

So if a thrift type was in the tokenized list, it won't be registered with the TBaseSerializer, which may? be worse than writing out the full string name but using an optimized serializer

jcdavis · 2017-03-27T18:10:52Z

Some more though on this: what if KryoHadoop added a new method eg def registerKryoSerializers(kryo: Kryo): Unit = {} ,with the intention that that should be overridden, not newKryo. Then the registering of classes could happen at the very end, after the call to ensure any custom default serializers have been properly configured?

johnynek · 2017-03-28T17:48:14Z

@jcdavis what you are describing is related to what chill calls an IKryoRegistrar:
https://github.com/twitter/chill/blob/develop/chill-java/src/main/java/com/twitter/chill/IKryoRegistrar.java#L23

The problem is that newKyro is implementing a class method, so it has to be there. The better approach is probably to invert it:

add

def registrar: IKryoRegistrar = 
  // This could even be on the companion so people could access the default registrar
  new IKryoRegistrar {
    def apply(k: Kryo) = {
      // add all the registrations here
    }
}

def newKryo = {
  val k = // make the kryo with no registrations
  registrar(k) // register, here is where people can override
  // now make sure all the classes are registered.
}

what do you think of that?

johnynek · 2017-03-28T17:50:55Z

@isnotinvain I don't follow exactly what you mean. This tells us the types in play, but it does not tell us a way to get instances of them, so I still don't see a way to test serialization better with this. If the user does .runHadoop it should serialize the data the user sets up. Did I misunderstand?

dieu · 2017-03-28T19:22:02Z

@jcdavis / @johnynek what about Execution do we want to turn on it for ExecutionApp?

johnynek · 2017-03-28T19:23:12Z

I would love to turn it on for Execution. Should be doable, but I have not looked at it yet. Any thoughts?

dieu · 2017-03-28T19:36:25Z

@jcdavis / @johnynek problem is that current implementation based on finding fields, but in Execution we don't have field and whole logic inside ExecutionApp.job.

only options I think is using a macro to inspect method body.

piyushnarang · 2017-03-28T20:28:53Z

I'm not sure if adding support for Executions is something we need to tackle as part of this PR. It seems like we could keep this iterative and first push out the Job based functionality and then followup with a different review for Execution support.
For the current review, what are the major items left to tackle? I was hoping to get some more tests in (which I commented on earlier) and @dieu also had some test scenarios in mind(I'll let him chime in with them). Not sure what other items there are. Trying to get a sense of roughly how long this might take given that we're waiting on this PR for the 0.17.0 release (#1641) and there's a bunch of other items waiting on that as well (like Storehaus 2.12 and Summingbird 2.12).

jcdavis · 2017-03-28T20:34:42Z

I think the last thing I'm working on is adding support for registering these classes in KryoHadoop. I'm not too familiar with Execution so punting on that for this PR makes sense.

johnynek · 2017-03-28T20:39:14Z

+1 to separate into a second PR.

relatedly, maybe #1658 could be useful for execution. I think some function like addCascadingTokens(classOf[MyObjectThatHasTheMethods], conf) could just work currently, but is a little manual.

piyushnarang · 2017-03-28T20:44:32Z

@jcdavis sounds good. Lets also add some tests around the scenarios I mentioned if possible. We have been hit a few times lately with bugs which would've been caught with more unit tests but ended up breaking prod jobs and the resultant pain.

Also added a new customRegistrar method so users can add additional default serializers before the tokenized classes registered

jcdavis · 2017-04-04T00:32:52Z

Sorry for the delay on this one, I spent most of last week firefighting so I haven't been able to make progress until today.

As discussed, I've added support in KryoHadoop for registering cascading tokenized classes which haven't been registered yet.

Also as per @dieu, it turns out that inner classes don't work currently. The root of the issue is that java's class names use $ for inner classes, whereas Type.fullName returns . all the way through. (eg it sees com.twitter.scalding.ReferencedClassFinder.C5 instead of ...$C5) This doesn't cause any issues because we catch the ClassNotFoundException and silently ignore, but it does mean such classes aren't tokenized. I've added a case to the test to check that things still work.

This is almost certainly fixable, but given this PR is already pretty lengthy I would suggest passing on that case for now.

jcdavis · 2017-04-04T00:37:38Z

FWIW, just the cascading tokenization has improved our jobs' CPU usage by as much as 20%. That plus these kryo changes and better kryo registration on our end (not using the default serializer unecessarily) seems to be worth another ~20% on top (so 30-35% CPU reduction overall), plus an 8-10% drop in bytes written to the network.

As always with hadoop benchmarks, YMMV. We were doing about as bad as one could do (no manual registration of any sort + missing serializers), so I suspect other folks might not see such big wins.

isnotinvain · 2017-04-04T00:56:44Z

@johnynek oh ok, I guess I misunderstood. I think it'd be nice if JobTest could trigger all the serialization that's going to happen in production so that we can make sure it works. But you're right, that's at the instance level and this is just scanning for classes.

piyushnarang · 2017-04-06T20:37:53Z

@jcdavis checking in, were you planning on adding more tests here? (Think you added tests for the inner classes, not sure if you were planning to add others for the other scenarios).
@johnynek / @isnotinvain / @dieu - are you guys good with these changes?

johnynek · 2017-04-07T17:53:53Z

I say we merge. Users can disable if we they have a problem and we can add tests. Let's get scalding 0.17 out.

piyushnarang · 2017-04-07T17:55:37Z

Sounds good to me. We can follow up with more tests if needed.

Use reflection over Jobs to find serialized classes

3f6052d

Using scala reflection, we can look at the types of TypedPipe/Grouped etc to identify classes being serialized and automatically assign them compact cascading tokens instead of writing full names.

jcdavis commented Mar 15, 2017

View reviewed changes

Jackson Davis added 2 commits March 15, 2017 18:09

Job-type-reflection: 2.10 cross-compiling changes

5e75869

PSA stringToTermName is deprecated in 2.11, to might need to revert some of this for 2.12 support

Job-type-reflection: fix private[this] case and test for val from trait

2c03780

oscar-stripe suggested changes Mar 16, 2017

View reviewed changes

Review fixes

bd868c0

Don't assign a token to a class that gets one from a Serialization

8504cb1

johnynek mentioned this pull request Mar 23, 2017

Request for Scalding release 0.17.0 #1641

Merged

johnynek approved these changes Mar 24, 2017

View reviewed changes

piyushnarang reviewed Mar 24, 2017

View reviewed changes

Jackson Davis added 2 commits March 24, 2017 17:02

Exclude java + scala arrays & primitives

169e1cc

Review fixes

7529bc5

Register cascading tokenized classes in KryoHadoop

9413a9c

Also added a new customRegistrar method so users can add additional default serializers before the tokenized classes registered

johnynek merged commit 68519c1 into twitter:develop Apr 7, 2017

johnynek mentioned this pull request Apr 24, 2017

Be more paranoid about Kryo registration order #1673

Merged

jcdavis deleted the job-type-reflection branch May 2, 2017 14:50

ianoc mentioned this pull request Oct 13, 2021

Optimizing KyroCoder in beam backend #1955

Open

		@@ -185,9 +185,13 @@ class Job(val args: Args) extends FieldConversions with java.io.Serializable {

		val init = base ++ modeConf

		val usedClasses: Set[Class[_]] = if (args.boolean("scalding.nojobclassreflection")) Set.empty else {


		import scala.reflect.runtime.universe

		object JobClassFinder {

Use reflection over Jobs to find serialized classes #1654

Use reflection over Jobs to find serialized classes #1654

Conversation

jcdavis commented Mar 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdavis commented Mar 17, 2017 • edited Loading

johnynek commented Mar 22, 2017

jcdavis commented Mar 23, 2017 • edited Loading

johnynek left a comment

Choose a reason for hiding this comment

johnynek commented Mar 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdavis Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Mar 24, 2017

isnotinvain commented Mar 24, 2017

jcdavis commented Mar 25, 2017 • edited Loading

jcdavis commented Mar 27, 2017 • edited Loading

johnynek commented Mar 28, 2017

johnynek commented Mar 28, 2017

dieu commented Mar 28, 2017

johnynek commented Mar 28, 2017

dieu commented Mar 28, 2017 • edited Loading

piyushnarang commented Mar 28, 2017

jcdavis commented Mar 28, 2017

johnynek commented Mar 28, 2017

piyushnarang commented Mar 28, 2017

jcdavis commented Apr 4, 2017 • edited Loading

jcdavis commented Apr 4, 2017

isnotinvain commented Apr 4, 2017

piyushnarang commented Apr 6, 2017

johnynek commented Apr 7, 2017

piyushnarang commented Apr 7, 2017

jcdavis commented Mar 17, 2017 •

edited

Loading

jcdavis commented Mar 23, 2017 •

edited

Loading

jcdavis Mar 24, 2017 •

edited

Loading

jcdavis commented Mar 25, 2017 •

edited

Loading

jcdavis commented Mar 27, 2017 •

edited

Loading

dieu commented Mar 28, 2017 •

edited

Loading

jcdavis commented Apr 4, 2017 •

edited

Loading