-
Notifications
You must be signed in to change notification settings - Fork 50
Annotated instances WIP #74
Annotated instances WIP #74
Conversation
I feel like something like |
Yeah, that would work just fine. Seems like this would also require removing the |
Related question: would the cached |
Ah interesting. For a while we had a Incidentally, a strategy we've played with for the problem you may be worrying about here (of overfitting to specific, high-volume customers), is to use the customer for the Instance |
It's for two different problems: one is preventing high traffic users from dominating the trees (the top 1% are being discarded entirely now), the other is for visualization and pattern/sequence mining. Sticking a HLL in |
Another related idea: splits that cover a wider time period might be preferred over a narrower period, particularly towards the root of a tree. |
Yeah, interesting idea. It does feel unsatisfying to combine the simple minimization of training error (or information gain or however you want to think about it) with what are basically anti-overfitting heuristics (ranging from a simple minimum instance count to this kind of much more sophisticated thing). It does seem reasonable for the framework to deal with these independently... I wonder if the right thing here is to have a
|
43a65ed
to
ebd14b8
Compare
Okay, now I'm attempting to implement your last suggestion. It's also not functional yet but it does seem a bit nicer. Does this seem like it's going in the right direction? |
} | ||
|
||
case class OutOfTimeSampler[K](base: Sampler[K], threshold: Long) extends Sampler[K] { | ||
case class OutOfTimeSampler[-M, -K](timestamp: M => Long, base: Sampler[M, K], threshold: Long) extends Sampler[M, K] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make this implicit and add a view from DefaultMetadata => Long
, or leave as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or this could change to being a MetadataFilterSampler
or something that takes a M=>Boolean
rather than hardcoding the idea of a Long
threshold.
Oh great! This definitely feels like the right direction. (Will respond to your more detailed comments later). |
(At this point do we want to make |
ebd14b8
to
40ff329
Compare
|
* @param features a map of named features that make up this instance | ||
* @param target a distribution of predictions or labels for this instance | ||
*/ | ||
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T) | ||
case class Instance[M, F, T](metadata: M, features: F, target: T) { | ||
def id(implicit eq: M =:= DefaultMetadata): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand adding these for source-level compatibility, but I think I'd rather just make this a breaking change (we can bump the version number appropriately) and get rid of the special casing here - or rather, push it to the places that actually need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, I've set up sbt-release and released a new version of Diorama. We plan on being much better about this in the future!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll yank this out.
It's getting closer, still some failing tests though. |
c51a053
to
943698e
Compare
@@ -1,6 +1,5 @@ | |||
package com.stripe | |||
|
|||
package object brushfire { | |||
type Tree[K, V, T] = AnnotatedTree[K, V, T, Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just delete this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
You asked about Hashable but I can't find that comment here anymore. Anyway, I kinda agree that |
5938023
to
4d2603c
Compare
4d2603c
to
9fe0d4e
Compare
Alright, I've backed out the |
@@ -320,18 +344,18 @@ case class Trainer[K: Ordering, V, T: Monoid]( | |||
} | |||
|
|||
/** add out of time validation */ | |||
def outOfTime(quantile: Double = 0.8) = { | |||
def outOfTime(quantile: Double = 0.8)(implicit eq: M =:= DefaultMetadata): Trainer[M, K, V, T, A] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to make this timestamp: M => Long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Alright, done with comments for now. |
Following up to @tixxit's earlier comment: using |
Hm, if I understand @tixxit's comment, though, he's implying that the final (post-present) type is what would live in the actual tree? I guess that's ok as long as you are storing those values for the interior nodes as well, but the fact that you lose the ability to sum the metadata for two subtrees to get the metadata for the parent bothers me a bit... |
(OTOH from a serialization POV I get it: it's often going to be much easier to serialize the final presentation type than the intermediate, summable one. So on reflection that probably does make it worthwhile). |
@NathanHowell what's the status of this PR - should we consider it complete at this point? |
I have some fixes from Spark that need to get rolled into the Scalding On Wed, Dec 23, 2015 at 10:34 AM, Avi Bryant notifications@github.com
|
NOTE: this doesn't work right now, wanted to open this now before I actually complete the work.
I'm looking at efficient ways to calculate some statistics from Instance instances and attach them to tree leaf nodes. One idea I had was to replace the timestamp parameter (which I don't always need) with
A
and useSemigroup.sumOption
to populateLeafNode
instances as they're created. If the timestamp is needed it may also be an interesting thing to track with a(Min[Long], Max[Long])
tuple (or something fancier) to see time ranges per node.The real world problem is that I have a lot of
Instance
s and a lot of customers. AnInstance
is tied to a single customer, but a customer can have multipleInstance
s. I want to count the number of distinct customers per split (viaMonoid[HyperLogLog]
, which is usually much different from the target distribution. It would be interesting to calculate these statistics per target as well.