Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading notes of "Communication Efficient Distributed Machine Learning with the Parameter Server" By Li #138

Closed
zou000 opened this issue Dec 28, 2018 · 0 comments

Comments

@zou000
Copy link
Contributor

zou000 commented Dec 28, 2018

Communication Efficient Distributed Machine Learning with the Parameter Server

This paper proposed several ideas to optimize communications between workers and parameter servers. It demonstrated the efficacy of the approach by implementing Proximal Gradient Method in this framework, and also mathematically proved that the method converges with bounded delay PS (see below). The most important things for us, I think, is the categorization of PS.

Sequential distributed SGD

  • Worker loop, at iteration k

    Load training dataset
    Compute gradient g(k) using model m(k)
    Push g(k) to PS
    Pull m(k+1) from PS
    
  • PS loop, at iteration k

    Sum g(k) from all workers
    Compute m(k+1) from m(k) and the gradient sum
    

Relaxations

The sequential approach's latency is dominated by the slowest worker. There are two possible relaxations

Eventual consistency

This is essentially the same as Downpour SGD is DistBelif, where both PS and workers are asynchronous.

Bounded Delay

This method limits the staleness of the parameters, e.g. a worker will block until all parameters computation from τ times ago are finished.

Authors also observed that with delay bound increases, the learning rate should be decreased to ensure convergence.

Other Communication-Saving Techniques

The Authors also proposed several methods to save bandwidths cost of transmitting parameters.

  • Only push parameters with significant change.
  • Only push a random subset of parameters
  • (In proximal GD) Only push gradients that affect parameter computation on PS
  • Cache range of keys and use hash of the range to pull parameters
  • Use lossy or lossless compression on parameter values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant