Generalize Sparse/Dense vector #84

Closed
avibryant opened this Issue Dec 19, 2012 · 10 comments

Comments

Projects
None yet
3 participants
Collaborator

avibryant commented Dec 19, 2012

The recent refactoring of HLL does almost exactly what I am contemplating doing with MinHash, except that HLL is Map[Int,Max[Byte]] and I would want Map[Int,Min[Int]]. I'm wondering if we want to generalize to a OptionallySparseVector[V], with an OptionallySparseVectorMonoid[V:Monoid]. (Better name would be appreciated).

Collaborator

avibryant commented Dec 19, 2012

Name suggestion: AdaptiveVector. Thinking of having 4 representations: all zeros, exactly one non-zero, sparse (ie, majority zeros), and dense (ie, majority non-zeros).

Contributor

aaron-siegel commented Dec 19, 2012

Yes! Oscar suggested exactly this and we discussed it in the context of HLL. I decided to go ahead and implement the direct solution for HLL, ship to develop (I have an immediate need for it - jobs that otherwise completely choke on their heap limits), and then look for ways to factor out the sparse/dense logic.

Honestly I'm having trouble seeing the value of special casing the "all zeros" and "exactly one non-zero" representations. It seems like these can be efficiently conflated with the sparse representation. I also think the optimal cutoff for sparse -> dense conversion is much less than majority; in HLL I'm using 1/16 sparseness as the threshold, but that's just a gut call and not based on any science. It should probably be a parameter, since it partly depends on use case (in HLL, the sparse keys are ints and the values are bytes, which already gives a 5:1 ratio in unit storage requirements for sparse:dense, and there's added inefficiency in the map representation vs. vector. so i think 1/16 is reasonable. in other cases with more complicated values, a higher threshold could be optimal.)

Contributor

aaron-siegel commented Dec 19, 2012

Btw I'd also suggest not implementing an AdaptiveVector structure at all - rather, just have separate SparseVector and DenseVector, and then an AdaptiveVectorMonoid that knows how to switch between them intelligently. This seems cleaner to me. Then the AdaptiveVectorMonoid could be parameterized with the switching cutoff etc., keeping the data structures themselves more orderly.

Collaborator

avibryant commented Dec 19, 2012

FWIW, I'll have a draft pull request (with no tests yet) ready to submit shortly.

Collaborator

avibryant commented Dec 19, 2012

The "all zeros" and "exactly one non-zero" representations are memory/GC optimizations. Having a special all-zero object makes it very easy to do the optimization where plus(0, N) just directly returns N, without needing to allocate a new object. The "exactly one non-zero" case is extremely common in real-world use, where you might do a map or flatMap over some event stream and produce a stream of (idx,value) pairs, which you are then going to merge into increasingly dense vectors. Having a compact, single-object representation (vs. a wrapper around a Map which is at least two objects and likely more) seems worthwhile given the frequency with which these objects will be created.

I chose to enclose the whole thing as nested classes inside an AdaptiveSeq instance which fixes the length and internal monoid - this lets the compiler enforce that you're not trying to merge two incompatible instances, rather than having to jump through runtime hoops as in the IndexedSeqMonoid.

Contributor

aaron-siegel commented Dec 19, 2012

I understand that they're memory optimizations - just not convinced it's worth the extra complexity in logic. It seems like the "plus(0, N) directly returns N" shortcut can be easily implemented regardless of the representation of 0; and it's hard for me to see how the "exactly-one" instance is substantively more efficient than a sparse instance with a singleton Map(k -> v), which is pretty well optimized by scala. But you have a lot more experience with scala optimization than I do, so I won't press the case.

I like the idea of using nested instances of a parameterized enclosing class.

Contributor

aaron-siegel commented Dec 19, 2012

Tried to add line comments to your pull request, but got a "you have been blocked from this repository" message

Collaborator

avibryant commented Dec 19, 2012

Uh... what does that even mean? Blocked from my repo? From the twitter repo?

Contributor

aaron-siegel commented Dec 19, 2012

It must mean your repo, since I'm still able to add comments here. The comment I wanted to add is for Single.apply:

Should there be range checks on the value of idx (throw an IndexOutOfBoundsException if idx < 0 || idx >= len)? Likewise for Empty and Sparse... otherwise the semantics are not compatible with Dense

Collaborator

avibryant commented Dec 19, 2012

Yes, agreed, I'll add those. And I'll try to figure out the perms issue that's stopping you from commenting.

@sritchie sritchie closed this Oct 24, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment