-
Notifications
You must be signed in to change notification settings - Fork 116
Description
How to scale distributed deep learning?
The paper compares the strengths and weaknesses of synchronous and asynchronous SGD in terms of scalability and convergence speed. The authors considered All-Reduce SGD
for synchronous case, and Elastic Averaging SGD
and Gossiping SGD
for asynchronous case.
In general, synchronous all-reduce systems suffer from the straggling processors and are not robust in face of failing processors. On the other hand, asynchronous SGD using PS may create a network bottleneck and slowing convergence.
Algorithms
Synchronous All-Reduce SGD
For iteration k
:
every node i computes its own model update ΔΘᵢ
compute global model update ΔΘ ⇐ all-reduce-average(ΔΘ)
every node i updates its own global model Θ ⇐ Θ + ΔΘ
Elastic Averaging SGD
Different from "traditional" SGD methods where workers send gradients to PS, elastic averaging SGD sends model updates to PS.
Worker loop:
compute local model Θ(l)
for every τ times:
pull global model Θ(g)
compute model difference Θ(u) = β(Θ(l) - Θ(g))
send Θ(u) to PS
update local model Θ(l) ⇐ Θ(l) - Θ(u)
PS loop:
receive model difference Θ(u)
update global model Θ(g) ⇐ Θ(g) + Θ(u)
Note that:
- β is a hyperparameter controlling the moving rate
- local model and global model moves in different directions respect to model difference Θ(u)
Gossiping SGD
This algorithm can be seen as an asynchronous version of all-reduce SGD.
"Pull" version, worker loop:
compute local model Θ
for every τ times:
pick random target t, pull model Θ(t) from it
update local model Θ ⇐ average(Θ, Θ(t))
"Push" version, worker loop:
compute local model Θ
for every τ times:
pick random target t, push Θ to t and myself
update local model Θ ⇐ average(all received Θ)
Findings
- For large step sizes, asynchronous SGD methods scales better.
- For smaller step sizes, all-reduce converges faster and give more accurate results.
- For small network (p = 8), all methods converge to roughly same minimum, elastic averaging SGD performs better than others.
- For larger network (p = 16, etc.), elastic averaging SGD performs badly, especially when step size becomes smaller. Gossiping SGDs are a bit better but also have convergence problems for small step sizes.