Skip to content
This repository has been archived by the owner on Mar 10, 2021. It is now read-only.

Reduce shuffle write size in DistributedLDA #4

Closed
rjagerman opened this issue Jun 10, 2015 · 2 comments
Closed

Reduce shuffle write size in DistributedLDA #4

rjagerman opened this issue Jun 10, 2015 · 2 comments

Comments

@rjagerman
Copy link
Owner

The shuffle write size grows with the number of topics (K), the number of vocabulary terms (V) and the number of documents. For relatively small topic models (K=50, V=100,000) the shuffle write size is already larger than the input size (e.g. we put in 40GB of data and this results in a 50GB shuffle write).

Running LDA on a data set larger than the cluster's storage size will likely result in failures due to insufficient space for the shuffle writes.

@rjagerman
Copy link
Owner Author

Accumulators seem promising. The documentation only mentions accumulators for single values, but perhaps they can be extended to Matrices.

@rjagerman
Copy link
Owner Author

Accumulators suffer from the same problem as they are implemented internally as a reduce operation.

It turns out that this is intended behavior of Spark. Any reduce operation is a 'pull' operation not a 'push' operation. As a consequence this means that the full shuffle write should fit in the cluster (preferably in memory, but a spill-to-disk mechanic does exist in the newest versions of spark). As this grows with the number of partitions (and thus the data set), this can be problematic if the full data set does not fit.
For more info check section 3.2.2 of Optimizing Shuffle Performance in Spark.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant