Skip to content

Conversation

iramazanli
Copy link
Contributor

@iramazanli iramazanli commented May 26, 2021

Fixes : #24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

https://github.com/LiyuanLucasLiu/RAdam/blob/2f03dd197022da442c6a15c47321f4335d113a3f/radam/radam.py#L156

https://github.com/tensorflow/swift-apis/blob/f51ee4618d652a2419e998bf9418ad80bda67454/Sources/TensorFlow/Optimizers/MomentumBased.swift#L638

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 26, 2021

💊 CI failures summary and remediations

As of commit d8ac387 (more details on the Dr. CI page and at hud.pytorch.org/pr/58968):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_py3_clang5_asan_test2 Run tests 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@iramazanli iramazanli force-pushed the adding_radam branch 15 times, most recently from a648b1c to 2714cb0 Compare May 27, 2021 19:33
@iramazanli iramazanli requested a review from vincentqb May 27, 2021 19:34
@iramazanli iramazanli changed the title To add Rectified Adam Algorithm to Optim package To add Rectified Adam Algorithm to Optimizers Jun 1, 2021
@iramazanli iramazanli force-pushed the adding_radam branch 5 times, most recently from c2d19b3 to ee985fc Compare June 18, 2021 20:50
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vincentqb
Copy link
Contributor

vincentqb commented Jun 18, 2021

discussed offline: we'll keep the name as RAdam instead of PlainRAdam. If we need another version, we could explore calling it something else (ModifiedRAdam/ApproximateRAdam for this one?) or have a toggle?

@facebook-github-bot
Copy link
Contributor

@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0ff3634.

facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2021
Summary:
Fixes : #24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

https://github.com/LiyuanLucasLiu/RAdam/blob/2f03dd197022da442c6a15c47321f4335d113a3f/radam/radam.py#L156

https://github.com/tensorflow/swift-apis/blob/f51ee4618d652a2419e998bf9418ad80bda67454/Sources/TensorFlow/Optimizers/MomentumBased.swift#L638

Pull Request resolved: #58968

Reviewed By: gchanan

Differential Revision: D29241736

Pulled By: iramazanli

fbshipit-source-id: 288b9b1f3125fdc6c7a7bb23fde1ea5c201c0448
@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 57967dc498dee032dc189f9ab4fc264ab905581e.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 1abf45e.

@facebook-github-bot
Copy link
Contributor

@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@iramazanli iramazanli force-pushed the adding_radam branch 5 times, most recently from d35dadf to a5e1c12 Compare June 23, 2021 21:40
@facebook-github-bot
Copy link
Contributor

@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jun 24, 2021
Summary:
Fixes : #24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

https://github.com/LiyuanLucasLiu/RAdam/blob/2f03dd197022da442c6a15c47321f4335d113a3f/radam/radam.py#L156

https://github.com/tensorflow/swift-apis/blob/f51ee4618d652a2419e998bf9418ad80bda67454/Sources/TensorFlow/Optimizers/MomentumBased.swift#L638

Pull Request resolved: #58968

Reviewed By: vincentqb

Differential Revision: D29310601

Pulled By: iramazanli

fbshipit-source-id: b7bd487f72f1074f266687fd9c0c6be264a748a9
facebook-github-bot pushed a commit that referenced this pull request Jun 27, 2021
Summary:
Previously in the PR: #58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.

Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.

It has been one of the most used algorithm in Deep Learning community.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.

Pull Request resolved: #59161

Reviewed By: vincentqb

Differential Revision: D29360576

Pulled By: iramazanli

fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
…rch#59161)

Summary:
Previously in the PR: pytorch#58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.

Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.

It has been one of the most used algorithm in Deep Learning community.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.

Pull Request resolved: pytorch#59161

Reviewed By: vincentqb

Differential Revision: D29360576

Pulled By: iramazanli

fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
asuhan pushed a commit that referenced this pull request Jun 30, 2021
Summary:
Previously in the PR: #58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.

Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.

It has been one of the most used algorithm in Deep Learning community.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.

Pull Request resolved: #59161

Reviewed By: vincentqb

Differential Revision: D29360576

Pulled By: iramazanli

fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
xuruiyang pushed a commit to facebookresearch/ReAgent that referenced this pull request Sep 20, 2025
Summary:
Fixes : pytorch/pytorch#24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

https://github.com/LiyuanLucasLiu/RAdam/blob/2f03dd197022da442c6a15c47321f4335d113a3f/radam/radam.py#L156

https://github.com/tensorflow/swift-apis/blob/f51ee4618d652a2419e998bf9418ad80bda67454/Sources/TensorFlow/Optimizers/MomentumBased.swift#L638

Pull Request resolved: pytorch/pytorch#58968

Reviewed By: gchanan

Differential Revision: D29241736

Pulled By: iramazanli

fbshipit-source-id: 288b9b1f3125fdc6c7a7bb23fde1ea5c201c0448
xuruiyang pushed a commit to facebookresearch/ReAgent that referenced this pull request Sep 20, 2025
Summary:
Fixes : pytorch/pytorch#24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

https://github.com/LiyuanLucasLiu/RAdam/blob/2f03dd197022da442c6a15c47321f4335d113a3f/radam/radam.py#L156

https://github.com/tensorflow/swift-apis/blob/f51ee4618d652a2419e998bf9418ad80bda67454/Sources/TensorFlow/Optimizers/MomentumBased.swift#L638

Pull Request resolved: pytorch/pytorch#58968

Reviewed By: vincentqb

Differential Revision: D29310601

Pulled By: iramazanli

fbshipit-source-id: b7bd487f72f1074f266687fd9c0c6be264a748a9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement RAdam optimizer ?

4 participants