Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added combinations(n,k) to RichPipe #221

Merged
merged 16 commits into from Dec 11, 2012
Merged

Added combinations(n,k) to RichPipe #221

merged 16 commits into from Dec 11, 2012

Conversation

krishnanraman
Copy link

Added combinations(n,k) to RichPipe :

Given an int k, and an input of size n,
return a pipe with nCk combinations, with k columns per row

Computes nCk = n choose k, for large values of nCk

Use-case: Say you have 100 hashtags sitting in an array
You want a table with 5 hashtags per row, all possible combinations
But 100C5 is huge!!! Scala cannot handle it. But Scalding can!
If the hashtags are sitting in a string array, then
combinations[String](hashtags, 5)
will do the needful.

Algorithm: Use k pipes, cross pipes two at a time, filter out non-monotonic entries

eg. 10C2 = 10 choose 2
Use 2 pipes.
Pipe1 = (1,2,3,...10)
Pipe2 = (2,3,4....10)
Cross Pipe1 with Pipe2 for 10*9 = 90 tuples
Filter out tuples that are non-monotonic
For (t1,t2) we want t1<t2, otherwise reject.
This brings down 90 tuples to the desired 45 tuples = 10C2

@krishnanraman
Copy link
Author

flatMap on IterableSource breaks! Strange...
Compiles locally on my box.
Wonder what's going on...

@azymnis
Copy link
Contributor

azymnis commented Nov 7, 2012

Let me try and compile it on my laptop tomorrow. By the way, a little unit test would help. If anything just to show how this works.


}).pipe

val res2 = (1 to k).foldLeft(res)((a,b)=>{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why capture res2, why not just return the result of the foldLeft?

// pipe 2 = 2 to n
// pipe 3 = 3 to n etc
val n = input.size
val pipes = (1 to k).foldLeft(List[PN]())( (a,x) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the foldLeft here? Can't you just do a (1 to k).toList.map { x => ... }.reverse and get the same result?

@johnynek
Copy link
Collaborator

This is cool, and I want to merge it, but we do need unit tests first. The good news is combinatorics should be very testable for small n and k.

Can you add tests following the existing JobTest code in the test directory?

@krishnanraman
Copy link
Author

On: f(a,b,c) must have same semantics as f(f(a,b), c) for all of this to work.

a+b has the same semantics as a+b+c as a+b+c+d, in the sense that
a+b+c = ((a+b)+c)
a+b+c+d = (a+(b+(c+d)))

Similarly for f(a,b) = math.sqrt(a_a+b_b), f(a,b) = a*b etc

Why is this important ?
To perform Progressive Filtering , you compute a crossproduct & filter immediately
The tuples that pass the filter go on to the next cross-product
And so on & so forth

So if you wanted 3-tuples in (1 to 1000) space, the final result would have
1000_1000_1000 3-tuples...
But if you applied the filter at 1000*1000, you could eliminate a whole bunch of entries that failed the predicate
Ofcourse, for that to work, you would need f(a,b,c) to have the same semantics as f(f(a,b),c)

Explanation of the API: Say we want 3-tuples that sit inside a sphere of radius 20.0

Gimme the type of the result: Double
Gimme the bag the tuples come from : 0 to 10.0 by 0.1
Gimme the arity of the tuple : 3
Gimme the operator between any two tuples : sqrt( a_a+b_b)
Gimme the test that will break during progressive filtering : x=> x> 20
Gimme the final test that will pass after we have all the 3-tuples: x => x<20

Unit tests forthcoming...

@johnynek
Copy link
Collaborator

So it sounds like you definitely want to use Monoid[T] because we are already using that, and that is the contract of associativity (strictly, you just want a SemiGroup[T] but we don't have a class like that yet, but maybe we should add one).

@krishnanraman
Copy link
Author

Lemme give an example.

Let f(a,b,c) = sqrt( a_a + b_b + c_c)
Then sqrt( a_a+b_b+c_c) = sqrt( (sqrt(a_a+b_b)_sqrt(a_a+b_b) + c_c )
So f is good to go for progressive filtering.

But lets consider g(a,b,c) = a_a+b_b+c_c
Then a_a+b_b+c_c is not the same as (a_a+b_b)_(a_a+b_b) + c_c
So g is not a suitable function for this sort of progressive filtering operation.

I can give several such examples where associativity is preserved, but the functions are unsuitable for progressive filtering.

@johnynek
Copy link
Collaborator

I'm sure not all monoids are okay. That's what the resultOk and resultNotOk functions are for right? It seems like the constraint of associativity is not encoded in op: (T,T) => T, but if op: Monoid[T] it is.

Why not be explicit about using Monoid[T] since we already have that dependency and the contract for Monoid[T].plus is associativity (which is required here).

@krishnanraman
Copy link
Author

Other than using a Monoid, I've made most of the changes you've suggested.
My concern with Monoids is twofold -

  1. The consumers of the API are not going to grok monoid. Most of the linear-programming folk who use stuff like a+b+c==n, do not know or care if the composing operator is a monoid. These are generally Operations Research folk who are using Scalding for its capabilities with big data, not abstract algebra folk.
  2. We haven't shown that the composing operator is indeed a monoid. There are many functions, some of which seem to pass the f(a,b,c) = f(f(a,b),c) test, and others don't. Whether passing that test has something to do with associativity seems likely, but again we don't have a conclusive proof of the same.

@johnynek
Copy link
Collaborator

Let's leave aside the question of the monoid for a second.

When you say f(a,b,c) == f(f(a,b), c), it is a bit unclear to me exactly what f is, since f is clearly multi-arity (takes 2 or 3 inputs).

I assume you are saying that:

f(f(a,b),c)=f(a,f(b,c))

Which is to say that f is an associative binary operation on objects of a fixed type T (f :(T,T) => T).

Is this accurate?

Can you give any counterexample to the fact that you are requiring an associative operation?

@johnynek
Copy link
Collaborator

My next comment: there is nothing to grok in Monoid except that the name for the two concepts: associativity and a zero-element. But even more than that, I think you just need the associativity, and so you really need a SemiGroup which is exactly just an associative closed operation. So you are solving these problems over ordered-monotonic semigroups (where x+y >= x,y). I think encoding that requirement in the types is nice.

@krishnanraman
Copy link
Author

a. I am only requiring left-associativity, not right-associativity. Those are two very different things (atleast in the math community). Left-associativity is needed because I essentially build tuples using a crossWithSmaller inside a foldLeft.
If I were doing crossWithSmaller inside a foldRight, I'd need right-associativity....which would change the semantics...I mean, there'd probably be functions for whom right-assoc makes sense but not left-assoc ( say a/b/c, where the results on (a/b)/c differ from ((c/b)/a) )

b. The monoid-bit...ok, I'll abide by your protocols on that one. Personally I would prefer 2 function calls, one using encoded types & the other without, so the stats/OR folk can simply use the latter api....which is what I think would end up happening anyways.

val allc = (1 to k).toList.map( x=> Symbol("n"+x)) // all column names

val pipes = allc.zipWithIndex.map( x=> {
val pipe = IterableSource(List(""+n), 'n).read.flatMap('n->x._1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use "" + n rather than n.toString

Also, why are you using strings at all here? It looks like you are generating something that should be:

val pipe = IterableSource( (x._2 + 1 to n), x._1 ).read

Also, pipe is a bad name. Can you name it, val kUpToNPipe or something?

@johnynek
Copy link
Collaborator

johnynek commented Dec 5, 2012

Also: where are the unit tests?

johnynek added a commit that referenced this pull request Dec 11, 2012
Added combinations(n,k) to RichPipe
@johnynek johnynek merged commit 256de84 into twitter:develop Dec 11, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants