-
Notifications
You must be signed in to change notification settings - Fork 116
Closed
Description
DistBelif
Downpour SGD
Algorithm
- Parameters(model) sharded on different machines, each machine keeps a part of the model.
- Training data are divided into subsets, each subset trained using SGD on a (potentially outdated and inconsistent) copy of the model.
- Async in two aspects, model copies in trainers and model shard on PS.
Findings
- "There is little theory grounding for the safety of these operations on no-convex models, but in practice [...] remarkably effective"
- Adaptive learning rate, such as Afagrad, increases the robustness of training.
- Adagrad keeps a learning rate for each parameter on PS:
$η = γ / \sqrt{Σ_1^k{Δw^2}}$ - Warm-starting PS by a pre-trained local model replica also helps.
- Adagrad keeps a learning rate for each parameter on PS:
Sandblaster L-BFGS
L-BFGS is an offline batch training method. Sandblaster L-BFGS improved the robustness of the algorithm in a distributed environment.
Algorithm
- In addition to keeping a model shard, each PS machine also carries out a small set of operations on the parameters, such as addition, dot product, etc. A PS machine also keeps history cache required by L-BFGS algorithm.
- There is a central "coordinator" which runs the L-BFGS algorithm. It doesn't directly fetch parameters from PS. Instead, it asks PS to carry out required operations.
- Workers only fetches model at the beginning of each batch.
- For robustness, Sandblaster L-BFGS employs similar techniques as MapReduce backup tasks, in which each batch is divided into much smaller datasets and each worker assigned a dataset to work with.
Findings
- Downpour SGD with Adagrad uses fewer resources than Sandblaster L-BFGS
Metadata
Metadata
Assignees
Labels
No labels