Reduce shuffle write size in DistributedLDA #4

rjagerman · 2015-06-10T14:06:12Z

The shuffle write size grows with the number of topics (K), the number of vocabulary terms (V) and the number of documents. For relatively small topic models (K=50, V=100,000) the shuffle write size is already larger than the input size (e.g. we put in 40GB of data and this results in a 50GB shuffle write).

Running LDA on a data set larger than the cluster's storage size will likely result in failures due to insufficient space for the shuffle writes.

rjagerman · 2015-06-10T17:31:07Z

Accumulators seem promising. The documentation only mentions accumulators for single values, but perhaps they can be extended to Matrices.

rjagerman · 2015-06-24T13:08:58Z

Accumulators suffer from the same problem as they are implemented internally as a reduce operation.

It turns out that this is intended behavior of Spark. Any reduce operation is a 'pull' operation not a 'push' operation. As a consequence this means that the full shuffle write should fit in the cluster (preferably in memory, but a spill-to-disk mechanic does exist in the newest versions of spark). As this grows with the number of partitions (and thus the data set), this can be problematic if the full data set does not fit.
For more info check section 3.2.2 of Optimizing Shuffle Performance in Spark.

rjagerman added the enhancement label Jun 10, 2015

rjagerman closed this as completed Jun 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce shuffle write size in DistributedLDA #4

Reduce shuffle write size in DistributedLDA #4

rjagerman commented Jun 10, 2015

rjagerman commented Jun 10, 2015

rjagerman commented Jun 24, 2015

Reduce shuffle write size in DistributedLDA #4

Reduce shuffle write size in DistributedLDA #4

Comments

rjagerman commented Jun 10, 2015

rjagerman commented Jun 10, 2015

rjagerman commented Jun 24, 2015