Skip to content

Reading notes for "How to scale distributed deep learning?" by Li #147

@zou000

Description

@zou000

How to scale distributed deep learning?

The paper compares the strengths and weaknesses of synchronous and asynchronous SGD in terms of scalability and convergence speed. The authors considered All-Reduce SGD for synchronous case, and Elastic Averaging SGD and Gossiping SGD for asynchronous case.

In general, synchronous all-reduce systems suffer from the straggling processors and are not robust in face of failing processors. On the other hand, asynchronous SGD using PS may create a network bottleneck and slowing convergence.

Algorithms

Synchronous All-Reduce SGD

For iteration k:

    every node i computes its own model update ΔΘᵢ
    compute global model update ΔΘ ⇐ all-reduce-average(ΔΘ)
    every node i updates its own global model Θ ⇐ Θ + ΔΘ

Elastic Averaging SGD

Different from "traditional" SGD methods where workers send gradients to PS, elastic averaging SGD sends model updates to PS.

Worker loop:

    compute local model Θ(l)
    for every τ times:
        pull global model Θ(g)
        compute model difference Θ(u) = β(Θ(l) - Θ(g))
        send Θ(u) to PS
        update local model Θ(l) ⇐ Θ(l) - Θ(u)

PS loop:

    receive model difference Θ(u)
    update global model Θ(g) ⇐ Θ(g) + Θ(u)

Note that:

  • β is a hyperparameter controlling the moving rate
  • local model and global model moves in different directions respect to model difference Θ(u)

Gossiping SGD

This algorithm can be seen as an asynchronous version of all-reduce SGD.

"Pull" version, worker loop:

    compute local model Θ
    for every τ times:
        pick random target t, pull model Θ(t) from it
        update local model Θ ⇐ average(Θ, Θ(t))

"Push" version, worker loop:

    compute local model Θ
    for every τ times:
        pick random target t, push Θ to t and myself
        update local model Θ ⇐ average(all received Θ)

Findings

  • For large step sizes, asynchronous SGD methods scales better.
  • For smaller step sizes, all-reduce converges faster and give more accurate results.
  • For small network (p = 8), all methods converge to roughly same minimum, elastic averaging SGD performs better than others.
  • For larger network (p = 16, etc.), elastic averaging SGD performs badly, especially when step size becomes smaller. Gossiping SGDs are a bit better but also have convergence problems for small step sizes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions