Annotated instances WIP #74

NathanHowell · 2015-10-09T02:09:16Z

NOTE: this doesn't work right now, wanted to open this now before I actually complete the work.

I'm looking at efficient ways to calculate some statistics from Instance instances and attach them to tree leaf nodes. One idea I had was to replace the timestamp parameter (which I don't always need) with A and use Semigroup.sumOption to populate LeafNode instances as they're created. If the timestamp is needed it may also be an interesting thing to track with a (Min[Long], Max[Long]) tuple (or something fancier) to see time ranges per node.

The real world problem is that I have a lot of Instances and a lot of customers. An Instance is tied to a single customer, but a customer can have multiple Instances. I want to count the number of distinct customers per split (via Monoid[HyperLogLog], which is usually much different from the target distribution. It would be interesting to calculate these statistics per target as well.

tixxit · 2015-10-09T15:00:27Z

I feel like something like Aggregator may be better suited here to "sum" instance annotations. Namely, the annotation on the instance (call it L) is not necessarily the same type that we want as the final tree node annotation (call it A). So, Trainer would also take some Aggregator[L, _, A]. In your case, your L is the customer, the ignored middle type would be an HLL, and A would be the count (Int?).

NathanHowell · 2015-10-09T18:20:50Z

Yeah, that would work just fine. Seems like this would also require removing the Semigroup constraint on A (related to that other open pull request). Do you think this approach is worth pursuing?

NathanHowell · 2015-10-09T18:35:06Z

Related question: would the cached annotation on SplitNode still be necessary for your usage?

avibryant · 2015-10-11T05:28:32Z

Ah interesting. For a while we had a weight: W attribute on Instance for nearly exactly this use case (multiple charges per merchant, interested in unique merchants per split). What we switched to was just having a more complex T value - like, instead of having T be Map[Boolean,Long] it could be Map[Boolean,(Long,HLL)]. Would that work for you? There's probably some worthwhile work to be done making this kind of thing more convenient...

Incidentally, a strategy we've played with for the problem you may be worrying about here (of overfitting to specific, high-volume customers), is to use the customer for the Instance id, which means trees are sampling whole customers instead of individual instances, which means that no customer can be in every tree, which helps with the overfitting. (This, too, might be worth making more convenient as a pattern).

NathanHowell · 2015-10-12T19:20:50Z

It's for two different problems: one is preventing high traffic users from dominating the trees (the top 1% are being discarded entirely now), the other is for visualization and pattern/sequence mining.

Sticking a HLL in T would also be acceptable. I'll try this out later today.

NathanHowell · 2015-10-12T19:25:51Z

Another related idea: splits that cover a wider time period might be preferred over a narrower period, particularly towards the root of a tree.

avibryant · 2015-10-15T17:41:55Z

Yeah, interesting idea.

It does feel unsatisfying to combine the simple minimization of training error (or information gain or however you want to think about it) with what are basically anti-overfitting heuristics (ranging from a simple minimum instance count to this kind of much more sophisticated thing). It does seem reasonable for the framework to deal with these independently...

I wonder if the right thing here is to have a Instance[M,Map[K,V],T], where:

we roll the idea of timestamp and id into a generic metadata type param M
Sampler operates on M
we have some new trait which provides M => N and Semigroup[N] for aggregations of the metadata to annotate the nodes with
we expect splits to be evaluated both on T and on N

NathanHowell · 2015-10-20T20:38:36Z

Okay, now I'm attempting to implement your last suggestion. It's also not functional yet but it does seem a bit nicer. Does this seem like it's going in the right direction?

NathanHowell · 2015-10-20T22:56:24Z

brushfire-core/src/main/scala/com/stripe/brushfire/Samplers.scala

 }

-case class OutOfTimeSampler[K](base: Sampler[K], threshold: Long) extends Sampler[K] {
+case class OutOfTimeSampler[-M, -K](timestamp: M => Long, base: Sampler[M, K], threshold: Long) extends Sampler[M, K] {


I could make this implicit and add a view from DefaultMetadata => Long, or leave as-is.

Or this could change to being a MetadataFilterSampler or something that takes a M=>Boolean rather than hardcoding the idea of a Long threshold.

avibryant · 2015-10-21T06:14:31Z

Oh great! This definitely feels like the right direction. (Will respond to your more detailed comments later).

avibryant · 2015-10-21T06:15:39Z

(At this point do we want to make AnnotatedTree be just Tree? I don't think we use it without an A anywhere anymore, right?)

NathanHowell · 2015-10-21T20:22:58Z

AnnotatedTree isn't required anymore. The latest push removes this, but is still not fully working yet.

avibryant · 2015-10-25T17:08:27Z

brushfire-core/src/main/scala/com/stripe/brushfire/Brushfire.scala

 * @param features a map of named features that make up this instance
 * @param target a distribution of predictions or labels for this instance
 */
-case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)
+case class Instance[M, F, T](metadata: M, features: F, target: T) {
+  def id(implicit eq: M =:= DefaultMetadata): String = {


I understand adding these for source-level compatibility, but I think I'd rather just make this a breaking change (we can bump the version number appropriately) and get rid of the special casing here - or rather, push it to the places that actually need it.

As a side note, I've set up sbt-release and released a new version of Diorama. We plan on being much better about this in the future!

I'll yank this out.

NathanHowell · 2015-11-05T23:10:30Z

It's getting closer, still some failing tests though.

NathanHowell · 2015-11-10T02:10:23Z

brushfire-core/src/main/scala/com/stripe/brushfire/package.scala

@@ -1,6 +1,5 @@
 package com.stripe

 package object brushfire {
-  type Tree[K, V, T] = AnnotatedTree[K, V, T, Unit]


Should I just delete this file?

avibryant · 2015-11-10T19:39:16Z

You asked about Hashable but I can't find that comment here anymore. Anyway, I kinda agree that M => String is probably just as good for now.

NathanHowell · 2015-11-10T21:43:52Z

Alright, I've backed out the Hashable goo and tried to minimize the diff.

NathanHowell · 2015-11-10T21:48:31Z

brushfire-scalding/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala

@@ -320,18 +344,18 @@ case class Trainer[K: Ordering, V, T: Monoid](
  }

  /** add out of time validation */
-  def outOfTime(quantile: Double = 0.8) = {
+  def outOfTime(quantile: Double = 0.8)(implicit eq: M =:= DefaultMetadata): Trainer[M, K, V, T, A] = {


Probably better to make this timestamp: M => Long

NathanHowell · 2015-11-10T22:06:26Z

Alright, done with comments for now.

NathanHowell · 2015-11-13T21:10:40Z

Following up to @tixxit's earlier comment: using Aggregator may be the best route. We're half way there already and I ended up needing something like Aggregator.present in our training pipeline.

avibryant · 2015-12-03T19:00:56Z

Hm, if I understand @tixxit's comment, though, he's implying that the final (post-present) type is what would live in the actual tree? I guess that's ok as long as you are storing those values for the interior nodes as well, but the fact that you lose the ability to sum the metadata for two subtrees to get the metadata for the parent bothers me a bit...

avibryant · 2015-12-03T19:01:30Z

(OTOH from a serialization POV I get it: it's often going to be much easier to serialize the final presentation type than the intermediate, summable one. So on reflection that probably does make it worthwhile).

avibryant · 2015-12-23T18:33:59Z

@NathanHowell what's the status of this PR - should we consider it complete at this point?

NathanHowell · 2015-12-23T18:35:05Z

I have some fixes from Spark that need to get rolled into the Scalding
driver. Otherwise it's working.

On Wed, Dec 23, 2015 at 10:34 AM, Avi Bryant notifications@github.com
wrote:

@NathanHowell https://github.com/NathanHowell what's the status of this
PR - should we consider it complete at this point?

—
Reply to this email directly or view it on GitHub
#74 (comment).

NathanHowell force-pushed the instance-annotations branch from 43a65ed to ebd14b8 Compare October 20, 2015 20:35

NathanHowell reviewed Oct 20, 2015
View reviewed changes

NathanHowell force-pushed the instance-annotations branch from ebd14b8 to 40ff329 Compare October 21, 2015 20:22

avibryant reviewed Oct 25, 2015
View reviewed changes

Rename AnnotatedTree to Tree

d212d8c

NathanHowell force-pushed the instance-annotations branch 2 times, most recently from c51a053 to 943698e Compare November 10, 2015 01:57

NathanHowell reviewed Nov 10, 2015
View reviewed changes

NathanHowell force-pushed the instance-annotations branch from 5938023 to 4d2603c Compare November 10, 2015 21:41

Nathan Howell added 4 commits November 10, 2015 13:42

Merge instance metadata into tree metadata when generating splits

367e08c

Add Lazy class to emulate existing lazy annotation behavior

26d9cd8

Rename OutOfTimeSampler to MetadataFilterSampler

fe9aed7

Setting version to 0.7.0-SNAPSHOT

9fe0d4e

NathanHowell force-pushed the instance-annotations branch from 4d2603c to 9fe0d4e Compare November 10, 2015 21:42

NathanHowell reviewed Nov 10, 2015
View reviewed changes

avibryant mentioned this pull request Dec 23, 2015

Make all splits binary #76

Merged

NathanHowell closed this Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated instances WIP #74

Annotated instances WIP #74

NathanHowell commented Oct 9, 2015

tixxit commented Oct 9, 2015

NathanHowell commented Oct 9, 2015

NathanHowell commented Oct 9, 2015

avibryant commented Oct 11, 2015

NathanHowell commented Oct 12, 2015

NathanHowell commented Oct 12, 2015

avibryant commented Oct 15, 2015

NathanHowell commented Oct 20, 2015

NathanHowell Oct 20, 2015

avibryant Oct 25, 2015

avibryant commented Oct 21, 2015

avibryant commented Oct 21, 2015

NathanHowell commented Oct 21, 2015

avibryant Oct 25, 2015

tixxit Oct 26, 2015

NathanHowell Nov 3, 2015

NathanHowell commented Nov 5, 2015

NathanHowell Nov 10, 2015

avibryant Nov 10, 2015

avibryant commented Nov 10, 2015

NathanHowell commented Nov 10, 2015

NathanHowell Nov 10, 2015

avibryant Nov 15, 2015

NathanHowell commented Nov 10, 2015

NathanHowell commented Nov 13, 2015

avibryant commented Dec 3, 2015

avibryant commented Dec 3, 2015

avibryant commented Dec 23, 2015

NathanHowell commented Dec 23, 2015

Annotated instances WIP #74

Annotated instances WIP #74

Conversation

NathanHowell commented Oct 9, 2015

tixxit commented Oct 9, 2015

NathanHowell commented Oct 9, 2015

NathanHowell commented Oct 9, 2015

avibryant commented Oct 11, 2015

NathanHowell commented Oct 12, 2015

NathanHowell commented Oct 12, 2015

avibryant commented Oct 15, 2015

NathanHowell commented Oct 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avibryant commented Oct 21, 2015

avibryant commented Oct 21, 2015

NathanHowell commented Oct 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanHowell commented Nov 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avibryant commented Nov 10, 2015

NathanHowell commented Nov 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanHowell commented Nov 10, 2015

NathanHowell commented Nov 13, 2015

avibryant commented Dec 3, 2015

avibryant commented Dec 3, 2015

avibryant commented Dec 23, 2015

NathanHowell commented Dec 23, 2015