-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched groups for TypedPipe #1318
Comments
I think that's not quite right. You want In fact in that approach wouldn't this work: this.map { v =>
val k = math.floor(random.nextDouble * batchSize)
(k, v)
}
.group |
These won't be able to give you an at most bounds right? they are approximate bucketing? |
Yeah, I totally lost track of what was supposed to be happening here. disregard. :) Still what is this doing: (random.nextLong % size) / batchSize so, |
Yep, that could be exact since we can revisit the same persisted data on disk from the first pass. (So i guess you'd need to ensure you've done a forceToDisk before the operation to ensure its deterministic). But that'd work i think |
I am currently trying to solve this problem. I need to partition a pipe to multiple groups but each group can contain at most 50000(say batchSize) entries. |
@aalmahmud that can be a workaround if your total number of buckets is not too many reducers to spawn in one step. I will work on a PR based on this thread so we have a proper solution. |
With a TypedPipe, suppose we want to generate groups such that there are at most k values in each group, it can currently be done using something like:
However, this can be too slow for large datasets. It should be possible to use the total size of the dataset using groupAll + size instead. The extra mr step should still be faster than the earlier groupAll. Will something like the following work?
The text was updated successfully, but these errors were encountered: