Reading notes for "How to scale distributed deep learning?" by Li

## [How to scale distributed deep learning?](https://arxiv.org/abs/1611.04581)

The paper compares the strengths and weaknesses of synchronous and asynchronous SGD in terms of scalability and convergence speed. The authors considered `All-Reduce SGD` for synchronous case, and `Elastic Averaging SGD` and `Gossiping SGD` for asynchronous case.

In general, synchronous all-reduce systems suffer from the straggling processors and are not robust in face of failing processors. On the other hand, asynchronous SGD using PS may create a network bottleneck and slowing convergence.

### Algorithms

#### Synchronous All-Reduce SGD

For iteration `k`:
```
    every node i computes its own model update ΔΘᵢ
    compute global model update ΔΘ ⇐ all-reduce-average(ΔΘ)
    every node i updates its own global model Θ ⇐ Θ + ΔΘ
```

#### Elastic Averaging SGD

Different from "traditional" SGD methods where workers send gradients to PS, elastic averaging SGD sends model updates to PS.

Worker loop:
```
    compute local model Θ(l)
    for every τ times:
        pull global model Θ(g)
        compute model difference Θ(u) = β(Θ(l) - Θ(g))
        send Θ(u) to PS
        update local model Θ(l) ⇐ Θ(l) - Θ(u)
```

PS loop:
```
    receive model difference Θ(u)
    update global model Θ(g) ⇐ Θ(g) + Θ(u)
```

Note that:
* β is a hyperparameter controlling the moving rate
* local model and global model moves in different directions respect to model difference Θ(u)

#### Gossiping SGD

This algorithm can be seen as an asynchronous version of all-reduce SGD.

"Pull" version, worker loop:
```
    compute local model Θ
    for every τ times:
        pick random target t, pull model Θ(t) from it
        update local model Θ ⇐ average(Θ, Θ(t))
```

"Push" version, worker loop:
```
    compute local model Θ
    for every τ times:
        pick random target t, push Θ to t and myself
        update local model Θ ⇐ average(all received Θ)
```


### Findings

* For large step sizes, asynchronous SGD methods scales better.
* For smaller step sizes, all-reduce converges faster and give more accurate results.
* For small network (p = 8), all methods converge to roughly same minimum, elastic averaging SGD performs better than others.
* For larger network (p = 16, etc.), elastic averaging SGD performs badly, especially when step size becomes smaller. Gossiping SGDs are a bit better but also have convergence problems for small step sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reading notes for "How to scale distributed deep learning?" by Li #147

How to scale distributed deep learning?

Algorithms

Synchronous All-Reduce SGD

Elastic Averaging SGD

Gossiping SGD

Findings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading notes for "How to scale distributed deep learning?" by Li #147

Description

How to scale distributed deep learning?

Algorithms

Synchronous All-Reduce SGD

Elastic Averaging SGD

Gossiping SGD

Findings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions