You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This paper proposed several ideas to optimize communications between workers and parameter servers. It demonstrated the efficacy of the approach by implementing Proximal Gradient Method in this framework, and also mathematically proved that the method converges with bounded delay PS (see below). The most important things for us, I think, is the categorization of PS.
Sequential distributed SGD
Worker loop, at iteration k
Load training dataset
Compute gradient g(k) using model m(k)
Push g(k) to PS
Pull m(k+1) from PS
PS loop, at iteration k
Sum g(k) from all workers
Compute m(k+1) from m(k) and the gradient sum
Relaxations
The sequential approach's latency is dominated by the slowest worker. There are two possible relaxations
Eventual consistency
This is essentially the same as Downpour SGD is DistBelif, where both PS and workers are asynchronous.
Bounded Delay
This method limits the staleness of the parameters, e.g. a worker will block until all parameters computation from τ times ago are finished.
Authors also observed that with delay bound increases, the learning rate should be decreased to ensure convergence.
Other Communication-Saving Techniques
The Authors also proposed several methods to save bandwidths cost of transmitting parameters.
Only push parameters with significant change.
Only push a random subset of parameters
(In proximal GD) Only push gradients that affect parameter computation on PS
Cache range of keys and use hash of the range to pull parameters
Use lossy or lossless compression on parameter values.
The text was updated successfully, but these errors were encountered:
Communication Efficient Distributed Machine Learning with the Parameter Server
This paper proposed several ideas to optimize communications between workers and parameter servers. It demonstrated the efficacy of the approach by implementing Proximal Gradient Method in this framework, and also mathematically proved that the method converges with bounded delay PS (see below). The most important things for us, I think, is the categorization of PS.
Sequential distributed SGD
Worker loop, at iteration
k
PS loop, at iteration
k
Relaxations
The sequential approach's latency is dominated by the slowest worker. There are two possible relaxations
Eventual consistency
This is essentially the same as Downpour SGD is DistBelif, where both PS and workers are asynchronous.
Bounded Delay
This method limits the staleness of the parameters, e.g. a worker will block until all parameters computation from
τ
times ago are finished.Authors also observed that with delay bound increases, the learning rate should be decreased to ensure convergence.
Other Communication-Saving Techniques
The Authors also proposed several methods to save bandwidths cost of transmitting parameters.
The text was updated successfully, but these errors were encountered: