New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added combinations(n,k) to RichPipe #221
Conversation
flatMap on IterableSource breaks! Strange... |
Let me try and compile it on my laptop tomorrow. By the way, a little unit test would help. If anything just to show how this works. |
|
||
}).pipe | ||
|
||
val res2 = (1 to k).foldLeft(res)((a,b)=>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why capture res2, why not just return the result of the foldLeft?
// pipe 2 = 2 to n | ||
// pipe 3 = 3 to n etc | ||
val n = input.size | ||
val pipes = (1 to k).foldLeft(List[PN]())( (a,x) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need the foldLeft here? Can't you just do a (1 to k).toList.map { x => ... }.reverse and get the same result?
This is cool, and I want to merge it, but we do need unit tests first. The good news is combinatorics should be very testable for small n and k. Can you add tests following the existing JobTest code in the test directory? |
On: f(a,b,c) must have same semantics as f(f(a,b), c) for all of this to work. a+b has the same semantics as a+b+c as a+b+c+d, in the sense that Similarly for f(a,b) = math.sqrt(a_a+b_b), f(a,b) = a*b etc Why is this important ? So if you wanted 3-tuples in (1 to 1000) space, the final result would have Explanation of the API: Say we want 3-tuples that sit inside a sphere of radius 20.0 Gimme the type of the result: Double Unit tests forthcoming... |
So it sounds like you definitely want to use Monoid[T] because we are already using that, and that is the contract of associativity (strictly, you just want a SemiGroup[T] but we don't have a class like that yet, but maybe we should add one). |
Lemme give an example. Let f(a,b,c) = sqrt( a_a + b_b + c_c) But lets consider g(a,b,c) = a_a+b_b+c_c I can give several such examples where associativity is preserved, but the functions are unsuitable for progressive filtering. |
I'm sure not all monoids are okay. That's what the resultOk and resultNotOk functions are for right? It seems like the constraint of associativity is not encoded in op: (T,T) => T, but if op: Monoid[T] it is. Why not be explicit about using Monoid[T] since we already have that dependency and the contract for Monoid[T].plus is associativity (which is required here). |
Other than using a Monoid, I've made most of the changes you've suggested.
|
Let's leave aside the question of the monoid for a second. When you say f(a,b,c) == f(f(a,b), c), it is a bit unclear to me exactly what f is, since f is clearly multi-arity (takes 2 or 3 inputs). I assume you are saying that: f(f(a,b),c)=f(a,f(b,c)) Which is to say that f is an associative binary operation on objects of a fixed type T (f :(T,T) => T). Is this accurate? Can you give any counterexample to the fact that you are requiring an associative operation? |
My next comment: there is nothing to grok in Monoid except that the name for the two concepts: associativity and a zero-element. But even more than that, I think you just need the associativity, and so you really need a SemiGroup which is exactly just an associative closed operation. So you are solving these problems over ordered-monotonic semigroups (where x+y >= x,y). I think encoding that requirement in the types is nice. |
a. I am only requiring left-associativity, not right-associativity. Those are two very different things (atleast in the math community). Left-associativity is needed because I essentially build tuples using a crossWithSmaller inside a foldLeft. b. The monoid-bit...ok, I'll abide by your protocols on that one. Personally I would prefer 2 function calls, one using encoded types & the other without, so the stats/OR folk can simply use the latter api....which is what I think would end up happening anyways. |
val allc = (1 to k).toList.map( x=> Symbol("n"+x)) // all column names | ||
|
||
val pipes = allc.zipWithIndex.map( x=> { | ||
val pipe = IterableSource(List(""+n), 'n).read.flatMap('n->x._1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use "" + n rather than n.toString
Also, why are you using strings at all here? It looks like you are generating something that should be:
val pipe = IterableSource( (x._2 + 1 to n), x._1 ).read
Also, pipe
is a bad name. Can you name it, val kUpToNPipe or something?
Also: where are the unit tests? |
Added combinations(n,k) to RichPipe
Added combinations(n,k) to RichPipe :
Given an int k, and an input of size n,
return a pipe with nCk combinations, with k columns per row
Computes nCk = n choose k, for large values of nCk
Use-case: Say you have 100 hashtags sitting in an array
You want a table with 5 hashtags per row, all possible combinations
But 100C5 is huge!!! Scala cannot handle it. But Scalding can!
If the hashtags are sitting in a string array, then
combinations[String](hashtags, 5)
will do the needful.
Algorithm: Use k pipes, cross pipes two at a time, filter out non-monotonic entries
eg. 10C2 = 10 choose 2
Use 2 pipes.
Pipe1 = (1,2,3,...10)
Pipe2 = (2,3,4....10)
Cross Pipe1 with Pipe2 for 10*9 = 90 tuples
Filter out tuples that are non-monotonic
For (t1,t2) we want t1<t2, otherwise reject.
This brings down 90 tuples to the desired 45 tuples = 10C2