Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
efb91b3
Update diffGrad.py
shivram1987 Feb 12, 2020
93fb7fc
Create diffGrad_v2.py
shivram1987 Mar 1, 2020
9a834f5
Update diffGrad.py
shivram1987 Mar 1, 2020
3642f5a
Update README.md
shivram1987 Mar 1, 2020
03e3b08
Update README.md
shivram1987 Mar 1, 2020
f48985a
Update README.md
shivram1987 Mar 1, 2020
9b847c3
Update README.md
swalpa Mar 1, 2020
b1d5790
Update README.md
swalpa Mar 1, 2020
d829732
Update README.md
swalpa Mar 1, 2020
a139346
Update README.md
swalpa Mar 1, 2020
e6602ef
Update README.md
swalpa Mar 1, 2020
0cb8110
Update README.md
swalpa Mar 1, 2020
4da5b9e
Add files via upload
swalpa Mar 1, 2020
3b9cc44
Update README.md
swalpa Mar 1, 2020
41880bc
Update README.md
swalpa Mar 1, 2020
b680fe6
Update README.md
swalpa Mar 1, 2020
4d229d6
Update README.md
shivram1987 Mar 1, 2020
4711eb1
Update README.md
shivram1987 Mar 1, 2020
feef68f
Update README.md
shivram1987 Mar 1, 2020
04ffc1c
Delete paper.pdf
shivram1987 Mar 1, 2020
8b8e1d1
Update README.md
swalpa Mar 1, 2020
a8d9c63
Update README.md
swalpa Mar 1, 2020
28ce049
Update README.md
shivram1987 Mar 9, 2020
d488956
Update README.md
swalpa Mar 9, 2020
a48101f
Update README.md
swalpa Mar 9, 2020
f8797c0
Update diffGrad.py
suvojit-0x55aa Apr 17, 2020
6cfc81d
Merge pull request #4 from suvojit-0x55aa/patch-1
swalpa Apr 17, 2020
de8d634
Update README.md
swalpa Apr 17, 2020
46a5e22
Update README.md
swalpa Apr 17, 2020
6c41db9
Update README.md
swalpa May 3, 2020
4dc28d1
Update README.md
swalpa May 3, 2020
de546fe
Update README.md
swalpa May 3, 2020
4ce100d
Update README.md
swalpa May 3, 2020
0937223
Update README.md
swalpa May 3, 2020
89f3e60
Update README.md
swalpa May 3, 2020
b652078
Update README.md
swalpa May 3, 2020
4ee7ed6
Update README.md
swalpa May 3, 2020
7598837
Update README.md
swalpa May 3, 2020
1828a87
Update README.md
swalpa May 6, 2020
fcebf03
Update README.md
shivram1987 May 6, 2020
0d64eff
Update README.md
swalpa Aug 3, 2020
2535d78
Add files via upload
swalpa Oct 9, 2020
7b2e9d4
Update README.md
swalpa Nov 30, 2020
3b41b1e
Update README.md
swalpa Nov 30, 2020
73233ba
Update README.md
shivram1987 Oct 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 47 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,58 @@
# diffGrad
# [diffGrad: An Optimization Method for Convolutional Neural Networks](https://ieeexplore.ieee.org/document/8939562)

diffGrad: An Optimization Method for Convolutional Neural Networks
<span class="color-blue"></span><sup><img style="display:inline"
src="http://personal.strath.ac.uk/jinchang.ren/index_files/new.gif" alt="" /></sup> <span class="newNews">The PyTorch implementation of diffGrad can be found in [torch-optimizer](https://pypi.org/project/torch-optimizer/#diffgrad) and can easily be used by following.

## How to use

<i>IEEE Transactions on Neural Networks and Learning Systems</i>, 2019
<pre><span class="c1">pip install torch-optimizer</span>

<b>Abstract—</b>Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take the advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms the other optimizers. Also, we showed that diffGrad performs uniformly well on network using different activation functions.
<span class="kn">import</span> <span class="nn">torch_optimizer</span> <span class="k">as</span> <span class="nn">optimizer</span>

IEEE TNNLS version: https://ieeexplore.ieee.org/document/8939562
<span class="c1"># model = ...</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">DiffGrad</span><span class="p">(</span>
<span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span>
<span class="n">lr</span><span class="o">=</span> <span class="mf">1e-3</span><span class="p">,</span>
<span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.999</span><span class="p">),</span>
<span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">,</span>
<span class="n">weight_decay</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
</pre>

arXiv version: https://arxiv.org/abs/1909.11015
## Issues

<span class="color-blue"></span><sup><img style="display:inline"
src="https://josaa.nic.in/webinfocms/Images/newicon.gif" alt="" /></sup> <span class="newNews">It is recommended to use diffGrad_v2.py which fixes [an issue](https://github.com/shivram1987/diffGrad/issues/2) in diffGrad.py.

<span class="color-blue"></span><sup><img style="display:inline"
src="https://josaa.nic.in/webinfocms/Images/newicon.gif" alt="" /></sup> <span class="newNews"> It is also recommended to refer [arXiv version](https://arxiv.org/abs/1909.11015) for the updated results.

## Abstract

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take the advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms the other optimizers. Also, we showed that diffGrad performs uniformly well on network using different activation functions.

## Citation

All experiments are perfomed using following framework: https://github.com/kuangliu/pytorch-cifar

Please cite this work as: S.R. Dubey, S. Chakraborty, S.K. Roy, S. Mukherjee, S.K. Singh, and B.B. Chaudhuri. diffGrad: An Optimization Method for Convolutional Neural Networks. <i>IEEE Transactions on Neural Networks and Learning Systems</i>, 2019.
If you use this code in your research, please cite as:

@article{dubey2019diffgrad,
title={diffGrad: An Optimization Method for Convolutional Neural Networks},
author={Dubey, Shiv Ram and Chakraborty, Soumendu and Roy, Swalpa Kumar and Mukherjee, Snehasis and Singh, Satish Kumar and Chaudhuri, Bidyut Baran},
journal={IEEE Transactions on Neural Networks and Learning Systems},
volume={31},
no={11},
pp.={4500 - 4511},
year={2020},
publisher={IEEE}
}

## Acknowledgement

All experiments are perfomed using following framework: https://github.com/kuangliu/pytorch-cifar


## License

Copyright (©2019): Shiv Ram Dubey, Indian Institute of Information Technology, Sri City, Chittoor, A.P., India
Copyright (©2019): Shiv Ram Dubey, Indian Institute of Information Technology, Sri City, Chittoor, A.P., India. Released under the MIT License. See [LICENSE](LICENSE) for details.
7 changes: 4 additions & 3 deletions diffGrad.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# This diffGrad implementation has a bug. Use diffGrad_v2.py.
import math
import torch
from torch.optim.optimizer import Optimizer
Expand Down Expand Up @@ -96,9 +97,9 @@ def step(self, closure=None):
# compute diffgrad coefficient (dfc)
diff = abs(previous_grad - grad)
dfc = 1. / (1. + torch.exp(-diff))
state['previous_grad'] = grad
# update momentum with dfc
state['previous_grad'] = grad # used in paper but has the bug that previous grad is overwritten with grad and diff becomes always zero. Fixed in the next line.
# update momentum with dfc
exp_avg1 = exp_avg * dfc

step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
Expand Down
312 changes: 312 additions & 0 deletions diffGrad_Regression_Loss.ipynb

Large diffs are not rendered by default.

106 changes: 106 additions & 0 deletions diffGrad_v2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Fixes a bug in original diffGrad code
import math
import torch
from torch.optim.optimizer import Optimizer
import numpy as np
import torch.nn as nn
#import torch.optim as Optimizer

class diffgrad(Optimizer):
r"""Implements diffGrad algorithm. It is modified from the pytorch implementation of Adam.
It has been proposed in `diffGrad: An Optimization Method for Convolutional Neural Networks`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
(default: False)
.. _diffGrad: An Optimization Method for Convolutional Neural Networks:
https://arxiv.org/abs/1909.11015
.. _Adam\: A Method for Stochastic Optimization:
https://arxiv.org/abs/1412.6980
.. _On the Convergence of Adam and Beyond:
https://openreview.net/forum?id=ryQu7f-RZ
"""

def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
super(diffgrad, self).__init__(params, defaults)

def __setstate__(self, state):
super(diffgrad, self).__setstate__(state)

def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()

for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('diffGrad does not support sparse gradients, please consider SparseAdam instead')

state = self.state[p]

# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)
# Previous gradient
state['previous_grad'] = torch.zeros_like(p.data)

exp_avg, exp_avg_sq, previous_grad = state['exp_avg'], state['exp_avg_sq'], state['previous_grad']
beta1, beta2 = group['betas']

state['step'] += 1

if group['weight_decay'] != 0:
grad.add_(group['weight_decay'], p.data)

# Decay the first and second moment running average coefficient
exp_avg.mul_(beta1).add_(1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group['eps'])

bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']

# compute diffgrad coefficient (dfc)
diff = abs(previous_grad - grad)
dfc = 1. / (1. + torch.exp(-diff))
#state['previous_grad'] = grad %used in paper but has the bug that previous grad is overwritten with grad and diff becomes always zero. Fixed in the next line.
state['previous_grad'] = grad.clone()

# update momentum with dfc
exp_avg1 = exp_avg * dfc

step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

p.data.addcdiv_(-step_size, exp_avg1, denom)

return loss