You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 10, 2021. It is now read-only.
The shuffle write size grows with the number of topics (K), the number of vocabulary terms (V) and the number of documents. For relatively small topic models (K=50, V=100,000) the shuffle write size is already larger than the input size (e.g. we put in 40GB of data and this results in a 50GB shuffle write).
Running LDA on a data set larger than the cluster's storage size will likely result in failures due to insufficient space for the shuffle writes.
The text was updated successfully, but these errors were encountered:
Accumulators suffer from the same problem as they are implemented internally as a reduce operation.
It turns out that this is intended behavior of Spark. Any reduce operation is a 'pull' operation not a 'push' operation. As a consequence this means that the full shuffle write should fit in the cluster (preferably in memory, but a spill-to-disk mechanic does exist in the newest versions of spark). As this grows with the number of partitions (and thus the data set), this can be problematic if the full data set does not fit.
For more info check section 3.2.2 of Optimizing Shuffle Performance in Spark.
The shuffle write size grows with the number of topics (K), the number of vocabulary terms (V) and the number of documents. For relatively small topic models (K=50, V=100,000) the shuffle write size is already larger than the input size (e.g. we put in 40GB of data and this results in a 50GB shuffle write).
Running LDA on a data set larger than the cluster's storage size will likely result in failures due to insufficient space for the shuffle writes.
The text was updated successfully, but these errors were encountered: