Skip to content
This repository has been archived by the owner on Apr 8, 2021. It is now read-only.

Annotated instances WIP #74

Closed

Conversation

NathanHowell
Copy link
Contributor

NOTE: this doesn't work right now, wanted to open this now before I actually complete the work.

I'm looking at efficient ways to calculate some statistics from Instance instances and attach them to tree leaf nodes. One idea I had was to replace the timestamp parameter (which I don't always need) with A and use Semigroup.sumOption to populate LeafNode instances as they're created. If the timestamp is needed it may also be an interesting thing to track with a (Min[Long], Max[Long]) tuple (or something fancier) to see time ranges per node.

The real world problem is that I have a lot of Instances and a lot of customers. An Instance is tied to a single customer, but a customer can have multiple Instances. I want to count the number of distinct customers per split (via Monoid[HyperLogLog], which is usually much different from the target distribution. It would be interesting to calculate these statistics per target as well.

@tixxit
Copy link
Contributor

tixxit commented Oct 9, 2015

I feel like something like Aggregator may be better suited here to "sum" instance annotations. Namely, the annotation on the instance (call it L) is not necessarily the same type that we want as the final tree node annotation (call it A). So, Trainer would also take some Aggregator[L, _, A]. In your case, your L is the customer, the ignored middle type would be an HLL, and A would be the count (Int?).

@NathanHowell
Copy link
Contributor Author

Yeah, that would work just fine. Seems like this would also require removing the Semigroup constraint on A (related to that other open pull request). Do you think this approach is worth pursuing?

@NathanHowell
Copy link
Contributor Author

Related question: would the cached annotation on SplitNode still be necessary for your usage?

@avibryant
Copy link
Contributor

Ah interesting. For a while we had a weight: W attribute on Instance for nearly exactly this use case (multiple charges per merchant, interested in unique merchants per split). What we switched to was just having a more complex T value - like, instead of having T be Map[Boolean,Long] it could be Map[Boolean,(Long,HLL)]. Would that work for you? There's probably some worthwhile work to be done making this kind of thing more convenient...

Incidentally, a strategy we've played with for the problem you may be worrying about here (of overfitting to specific, high-volume customers), is to use the customer for the Instance id, which means trees are sampling whole customers instead of individual instances, which means that no customer can be in every tree, which helps with the overfitting. (This, too, might be worth making more convenient as a pattern).

@NathanHowell
Copy link
Contributor Author

It's for two different problems: one is preventing high traffic users from dominating the trees (the top 1% are being discarded entirely now), the other is for visualization and pattern/sequence mining.

Sticking a HLL in T would also be acceptable. I'll try this out later today.

@NathanHowell
Copy link
Contributor Author

Another related idea: splits that cover a wider time period might be preferred over a narrower period, particularly towards the root of a tree.

@avibryant
Copy link
Contributor

Yeah, interesting idea.

It does feel unsatisfying to combine the simple minimization of training error (or information gain or however you want to think about it) with what are basically anti-overfitting heuristics (ranging from a simple minimum instance count to this kind of much more sophisticated thing). It does seem reasonable for the framework to deal with these independently...

I wonder if the right thing here is to have a Instance[M,Map[K,V],T], where:

  • we roll the idea of timestamp and id into a generic metadata type param M
  • Sampler operates on M
  • we have some new trait which provides M => N and Semigroup[N] for aggregations of the metadata to annotate the nodes with
  • we expect splits to be evaluated both on T and on N

@NathanHowell
Copy link
Contributor Author

Okay, now I'm attempting to implement your last suggestion. It's also not functional yet but it does seem a bit nicer. Does this seem like it's going in the right direction?

}

case class OutOfTimeSampler[K](base: Sampler[K], threshold: Long) extends Sampler[K] {
case class OutOfTimeSampler[-M, -K](timestamp: M => Long, base: Sampler[M, K], threshold: Long) extends Sampler[M, K] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make this implicit and add a view from DefaultMetadata => Long, or leave as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this could change to being a MetadataFilterSampler or something that takes a M=>Boolean rather than hardcoding the idea of a Long threshold.

@avibryant
Copy link
Contributor

Oh great! This definitely feels like the right direction. (Will respond to your more detailed comments later).

@avibryant
Copy link
Contributor

(At this point do we want to make AnnotatedTree be just Tree? I don't think we use it without an A anywhere anymore, right?)

@NathanHowell
Copy link
Contributor Author

AnnotatedTree isn't required anymore. The latest push removes this, but is still not fully working yet.

* @param features a map of named features that make up this instance
* @param target a distribution of predictions or labels for this instance
*/
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)
case class Instance[M, F, T](metadata: M, features: F, target: T) {
def id(implicit eq: M =:= DefaultMetadata): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand adding these for source-level compatibility, but I think I'd rather just make this a breaking change (we can bump the version number appropriately) and get rid of the special casing here - or rather, push it to the places that actually need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I've set up sbt-release and released a new version of Diorama. We plan on being much better about this in the future!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll yank this out.

@NathanHowell
Copy link
Contributor Author

It's getting closer, still some failing tests though.

@NathanHowell NathanHowell force-pushed the instance-annotations branch 2 times, most recently from c51a053 to 943698e Compare November 10, 2015 01:57
@@ -1,6 +1,5 @@
package com.stripe

package object brushfire {
type Tree[K, V, T] = AnnotatedTree[K, V, T, Unit]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just delete this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@avibryant
Copy link
Contributor

You asked about Hashable but I can't find that comment here anymore. Anyway, I kinda agree that M => String is probably just as good for now.

@NathanHowell
Copy link
Contributor Author

Alright, I've backed out the Hashable goo and tried to minimize the diff.

@@ -320,18 +344,18 @@ case class Trainer[K: Ordering, V, T: Monoid](
}

/** add out of time validation */
def outOfTime(quantile: Double = 0.8) = {
def outOfTime(quantile: Double = 0.8)(implicit eq: M =:= DefaultMetadata): Trainer[M, K, V, T, A] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to make this timestamp: M => Long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@NathanHowell
Copy link
Contributor Author

Alright, done with comments for now.

@NathanHowell
Copy link
Contributor Author

Following up to @tixxit's earlier comment: using Aggregator may be the best route. We're half way there already and I ended up needing something like Aggregator.present in our training pipeline.

@avibryant
Copy link
Contributor

Hm, if I understand @tixxit's comment, though, he's implying that the final (post-present) type is what would live in the actual tree? I guess that's ok as long as you are storing those values for the interior nodes as well, but the fact that you lose the ability to sum the metadata for two subtrees to get the metadata for the parent bothers me a bit...

@avibryant
Copy link
Contributor

(OTOH from a serialization POV I get it: it's often going to be much easier to serialize the final presentation type than the intermediate, summable one. So on reflection that probably does make it worthwhile).

@avibryant
Copy link
Contributor

@NathanHowell what's the status of this PR - should we consider it complete at this point?

@NathanHowell
Copy link
Contributor Author

I have some fixes from Spark that need to get rolled into the Scalding
driver. Otherwise it's working.

On Wed, Dec 23, 2015 at 10:34 AM, Avi Bryant notifications@github.com
wrote:

@NathanHowell https://github.com/NathanHowell what's the status of this
PR - should we consider it complete at this point?


Reply to this email directly or view it on GitHub
#74 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants