[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

The Best of All Worlds: Embracing Decentralization for Improved Communication Efficiency, Privacy, and Generalization

The repository contains the offical implementation of the paper

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Overview

Motivating question: The Best of All Worlds? Can we guarantee communication effiency, privacy and generalizablity all at once? Our recent ICML 2023 paper proves that decentralized training might be the anwer!

TLDR: The first work on the surprising sharpness-aware minimization nature of decentralized learning. We provide a completely new perspective to understand decentralization, which helps to bridge the gap between theory and practice in decentralized learning.

Abstract: Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

Environment Setup

Requisite packages can be installed directly from the requirements.txt.

pip install -r requirements.txt

Example of usage

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 1024 total batch sizes.

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 64 --mode "csgd" --size 16 --lr 0.1 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 64 --mode "ring" --size 16 --lr 0.1 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 8192 total batch sizes.

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 512 --mode "csgd" --size 16 --lr 0.8 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 512 --mode "ring" --size 16 --lr 0.8 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

More detailed scripts can be found in the "scripts" folder.

The 3D local loss landscape visualization is based on .

Citing this repository

Please cite our paper if you find this repo useful in your work:


@InProceedings{pmlr-v202-zhu23e,
  title = 	 {Decentralized {SGD} and Average-direction {SAM} are Asymptotically Equivalent},
  author =       {Zhu, Tongtian and He, Fengxiang and Chen, Kaixuan and Song, Mingli and Tao, Dacheng},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {43005--43036},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/zhu23e/zhu23e.pdf},
  url = 	 {https://proceedings.mlr.press/v202/zhu23e.html},
  abstract = 	 {Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.}
}

Contact

Please feel free to contact via email (raiden@zju.edu.cn) or Wechat (RaidenT_T) if you have any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
csv_data		csv_data
datasets		datasets
figures		figures
files		files
lamb		lamb
networks		networks
plots		plots
scripts		scripts
utils		utils
workers		workers
.gitignore		.gitignore
LICENSE		LICENSE
Poster_ICML2023_D_SGD_as_SAM.pdf		Poster_ICML2023_D_SGD_as_SAM.pdf
Poster_ICML2023_D_SGD_as_SAM.png		Poster_ICML2023_D_SGD_as_SAM.png
README.md		README.md
Slides_ICML_2023_Decentralized_SGD_and_Average_direction_SAM_are_Asymptotically_Equivalent .pdf		Slides_ICML_2023_Decentralized_SGD_and_Average_direction_SAM_are_Asymptotically_Equivalent .pdf
Slides_ICML_2023_Decentralized_SGD_and_Average_direction_SAM_are_Asymptotically_Equivalent_231204.pdf		Slides_ICML_2023_Decentralized_SGD_and_Average_direction_SAM_are_Asymptotically_Equivalent_231204.pdf
main.py		main.py
main_lamb.py		main_lamb.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

The Best of All Worlds: Embracing Decentralization for Improved Communication Efficiency, Privacy, and Generalization

Overview

Environment Setup

Example of usage

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 1024 total batch sizes.

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 8192 total batch sizes.

Citing this repository

Contact

About

Releases

Packages

Contributors 2

Languages

License

Raiden-Zhu/ICML-2023-DSGD-and-SAM

Folders and files

Latest commit

History

Repository files navigation

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

The Best of All Worlds: Embracing Decentralization for Improved Communication Efficiency, Privacy, and Generalization

Overview

Environment Setup

Example of usage

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 1024 total batch sizes.

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 8192 total batch sizes.

Citing this repository

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages