-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use reflection over Jobs to find serialized classes #1654
Conversation
Using scala reflection, we can look at the types of TypedPipe/Grouped etc to identify classes being serialized and automatically assign them compact cascading tokens instead of writing full names.
* Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't | ||
* referred to in a field | ||
*/ | ||
def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either this method or the object should be private[scalding]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need jobClass: Class[_ <: Job]
rather than outerClass: Class[_]
? Why does Job
matter?
For instance, I'd like to use this with Execution
as well, and we might want to pass an ExecutionApp
to get the same result (if you can find items).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could we possibly also walk the methods and look at the input and return types? This could cover all the cases perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not really job-related specifically, thats just what I started with to self-limit usage. I'll change it
I passed on methods for this pr to simplify things, but could probably add them. For methods, we might care about ones that return A
in addition to TypedPipe[A]
/etc, so you might need to be a little smarter about what you consider/ignore
} | ||
} | ||
|
||
private def getClassOpt(name: String): Option[Class[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could avoid loading the classes if I refactored CascadingTokenUpdate.updater
to take just the names instead of the classes, since it just calls .getName
PSA stringToTermName is deprecated in 2.11, to might need to revert some of this for 2.12 support
def setSerialization( | ||
kryo: Either[(Class[_ <: KryoInstantiator], KryoInstantiator), Class[_ <: KryoInstantiator]], | ||
userHadoop: Seq[Class[_ <: HSerialization[_]]] = Nil, | ||
additionalSerializedClasses: Set[Class[_]] = Set.empty): Config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the additionalSerializedClasses
here? I think callers should just do:
.setSerialization( ... )
.addCascadingClassSerializationTokens(additionalSerializedClasses)
But another concern is that sadly both cascading and kryo need to give tokens to classes. Can we make a kryo registrar:
https://github.com/twitter/chill/blob/develop/chill-java/src/main/java/com/twitter/chill/IKryoRegistrar.java
that registers any classes not already added? Then it can be combined with the KryoInstantiator:
https://github.com/twitter/chill/blob/develop/chill-java/src/main/java/com/twitter/chill/KryoInstantiator.java#L98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing Kryo as well but I'm not super familiar on things so I've punted for this pr.
My main confusion/concern is: how does the KryoRegistrar know the job class to inspect (Hadoop config? I don't see the full scalding job name anywhere), or does it parse the cascading token config to find classes? And does this get added to the chill KryoHadoop
, or do we add a new instantiator for this? (And then what happens if someone extends KryoHadoop
)
@@ -185,9 +185,13 @@ class Job(val args: Args) extends FieldConversions with java.io.Serializable { | |||
|
|||
val init = base ++ modeConf | |||
|
|||
val usedClasses: Set[Class[_]] = if (args.boolean("scalding.nojobclassreflection")) Set.empty else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this a private def
?
* Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't | ||
* referred to in a field | ||
*/ | ||
def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need jobClass: Class[_ <: Job]
rather than outerClass: Class[_]
? Why does Job
matter?
For instance, I'd like to use this with Execution
as well, and we might want to pass an ExecutionApp
to get the same result (if you can find items).
private val baseContainers = List( | ||
classOf[Execution[_]], | ||
classOf[TypedPipe[_]], | ||
classOf[TypedSink[_]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not want TypedSource[_]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add
case universe.TypeRef(_, _, args) => | ||
args.flatMap { generic => | ||
//If the wrapped type is a Tuple2, recurse into its types | ||
if (generic.typeSymbol.fullName == "scala.Tuple2") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not .fullName.startsWith("scala.Tuple")
to support all tuples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason beyond reducing the scope, I'll add
* Note: this not guaranteed to find every used type. Eg, it can't find types used in a step that isn't | ||
* referred to in a field | ||
*/ | ||
def findUsedClasses(jobClazz: Class[_ <: Job]): Set[Class[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could we possibly also walk the methods and look at the input and return types? This could cover all the cases perhaps?
|
||
import scala.reflect.runtime.universe | ||
|
||
object JobClassFinder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Job
has anything really to do with this. Can we just say ReferencedClassFinder
or something? Also, we might add a method like:
object ReferencedClassFinder {
/**
* Add the given type, as well as all referenced types to the cascading tokens list.
* note, for maximal efficiency, you should also register those types with the kryo
* instantiator being used.
*/
def addCascadingTokensFrom(c: Class[_], config: Config): Config
}
A small wrinkle here we have to think about: If we have a There are a few options here, of varying degrees of complication/hackiness:
Both 2 and 3 require the Hadoop Serializations config to have been set, which seems like a reasonable requirement based on how the Scalding config is init'ed. 2 and 3 could also protect against either class or token conflicts with other user-defined serializations with the |
2 sounds pretty good to me. |
Instead, I've replicated the logic of looking over the serializations for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
What do you think @piyushnarang ? |
CascadingTokenUpdater.update(config, findReferencedClasses(c) + c) | ||
} | ||
/** | ||
* Reflect over a scalding to try and identify types it uses so they can be tokenized by cascading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reflect over a scalding container type?
@@ -197,6 +198,10 @@ class Job(val args: Args) extends FieldConversions with java.io.Serializable { | |||
.toMap.toMap[AnyRef, AnyRef] // linter:ignore the second one is to lift from String -> AnyRef | |||
} | |||
|
|||
def reflectedClasses: Set[Class[_]] = if (args.boolean("scalding.nojobclassreflection")) Set.empty else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add this to Config.scala: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Config.scala#L431
with details on what users get when they turn this on / off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that I look at it, I'm not sure that makes sense? the Config
params are for stuff that go into the hadoop config. This is just being read from the Args
. Other things being read from args in the Job
(eg scalding.nocounters
, or scalding.flowstats
) aren't referenced in the config.
Agreed I should add documentation somewhere, just not sure if Config seems like the right place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add it to the companion Args object? We don't need to move refactor the other Args in this review but it seems nicer than having them scattered in a bunch of places right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args companion object seems pretty reasonable, I'll add it there
case class C3(c: Int) | ||
case class C4(d: Int) | ||
|
||
trait TraitType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add some tests for:
- Java types?
- Thrift / protobuf types?
- Testing that if we have some primitive types (like Int) referenced, we don't end up adding their tokens? (Or is that the same scenario as the BytesWriteable?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 and 2 should be fine, at least based on my testing.
Primitives seem to work, but in setSerialization
it filters out primitives and array types:
val kryoClasses = withKryo.getKryoRegisteredClasses
.filterNot(_.isPrimitive) // Cascading handles primitives and arrays
.filterNot(_.isArray)
I'll add those filters to findReferencedClasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add tests for these scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'll add some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's have tests for InnerClasses (Case classes in Object)
// We don't want to assign tokens to classes already in the map | ||
val newClasses: Iterable[String] = clazzes.map { _.getName } -- toks.values | ||
val newClasses: Iterable[String] = clazzes.map { _.getName } -- fromSerializations -- toks.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be checking if the number of these tokens exceeds what cascading supports? (max int?). Wondering if there are some really complex types that end up referencing a large graph of types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TupleSerialization
writes out the token as a varint, so there shouldn't be any concerns about assigning too many tokens
@jcdavis one last thing. you could get the registered class list back out of the config. With that list we could make sure at the end of all the registrations that everything named is registered on the resulting kryo. This would make sure that all classes use tokens at the hadoop, cascading and kryo levels (the tokens don't need to match with Kryo, but it would be nice to not write the class names on nested structures). using code similar to here: you check if each item in the list has a registration, if not, add it. |
Sort of related question: Can we re-use some of this to make JobTests try to serialize everything that's going to need serializing, to catch those sorts of errors in unit tests? That'd be super cool. |
@johnynek I have already been testing that for us locally, but for some reason I though KryoHadoop was in chill, thus I was gonna have to PR it there, do some versioning hell, etc etc. This makes that much easier :) My version locally uses a different config that has classes that are definitely not registered yet, which makes things simpler. I'll do some more looking at the Kryo api, but it seems like I can just call My one concern is that if someone overrides
So if a thrift type was in the tokenized list, it won't be registered with the |
Some more though on this: what if |
@jcdavis what you are describing is related to what chill calls an IKryoRegistrar: The problem is that add
what do you think of that? |
@isnotinvain I don't follow exactly what you mean. This tells us the types in play, but it does not tell us a way to get instances of them, so I still don't see a way to test serialization better with this. If the user does |
I would love to turn it on for Execution. Should be doable, but I have not looked at it yet. Any thoughts? |
I'm not sure if adding support for Executions is something we need to tackle as part of this PR. It seems like we could keep this iterative and first push out the Job based functionality and then followup with a different review for Execution support. |
I think the last thing I'm working on is adding support for registering these classes in |
+1 to separate into a second PR. relatedly, maybe #1658 could be useful for execution. I think some function like |
@jcdavis sounds good. Lets also add some tests around the scenarios I mentioned if possible. We have been hit a few times lately with bugs which would've been caught with more unit tests but ended up breaking prod jobs and the resultant pain. |
Also added a new customRegistrar method so users can add additional default serializers before the tokenized classes registered
Sorry for the delay on this one, I spent most of last week firefighting so I haven't been able to make progress until today. As discussed, I've added support in Also as per @dieu, it turns out that inner classes don't work currently. The root of the issue is that java's class names use This is almost certainly fixable, but given this PR is already pretty lengthy I would suggest passing on that case for now. |
FWIW, just the cascading tokenization has improved our jobs' CPU usage by as much as 20%. That plus these kryo changes and better kryo registration on our end (not using the default serializer unecessarily) seems to be worth another ~20% on top (so 30-35% CPU reduction overall), plus an 8-10% drop in bytes written to the network. As always with hadoop benchmarks, YMMV. We were doing about as bad as one could do (no manual registration of any sort + missing serializers), so I suspect other folks might not see such big wins. |
@johnynek oh ok, I guess I misunderstood. I think it'd be nice if JobTest could trigger all the serialization that's going to happen in production so that we can make sure it works. But you're right, that's at the instance level and this is just scanning for classes. |
@jcdavis checking in, were you planning on adding more tests here? (Think you added tests for the inner classes, not sure if you were planning to add others for the other scenarios). |
I say we merge. Users can disable if we they have a problem and we can add tests. Let's get scalding 0.17 out. |
Sounds good to me. We can follow up with more tests if needed. |
Using scala reflection, we can look at the container types of TypedPipe/Grouped
etc to identify classes being serialized and automatically assign them
compact cascading tokens instead of writing full names.
A few followups probably needed on this first pass:
TypedContainer
forTypedPipe
etc? That doesn't truly change the coverage issue, thoughTuple2
? There isn't too much cost in adding additional tokens to the cascading tokens config, even if they aren't used.