# Adadelta
:label:`sec_adadelta`

Adadelta is yet another variant of AdaGrad. The main difference lies in the fact that it decreases the amount by which the learning rate is adaptive to coordinates. Moreover, traditionally it referred to as not having a learning rate since it uses the amount of change itself as calibration for future change. The algorithm was proposed in :cite:`Zeiler.2012`. It is fairly straightforward, given the discussion of previous algorithms so far. 

## The Algorithm

In a nutshell Adadelta uses two state variables, $\mathbf{s}_t$ to store a leaky average of the second moment of the gradient and $\Delta\mathbf{x}_t$ to store a leaky average of the second moment of the change of parameters in the model itself. Note that we use the original notation and naming of the authors for compatibility with other publications and implementations (there is no other real reason why one should use different Greek variables to indicate a parameter serving the same purpose in momentum, Adagrad, RMSProp, and Adadelta). The parameter du jour is $\rho$. We obtain the following leaky updates:

$$\begin{aligned}
    \mathbf{s}_t & = \rho \mathbf{s}_{t-1} + (1 - \rho) \mathbf{g}_t^2, \\
    \mathbf{g}_t' & = \sqrt{\frac{\Delta\mathbf{x}_{t-1} + \epsilon}{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t, \\
    \mathbf{x}_t  & = \mathbf{x}_{t-1} - \mathbf{g}_t', \\
    \Delta \mathbf{x}_t & = \rho \Delta\mathbf{x}_{t-1} + (1 - \rho) \mathbf{x}_t^2.
\end{aligned}$$

The difference to before is that we perform updates with the rescaled gradient $\mathbf{g}_t'$ which is computed by taking the ratio between the average squared rate of change and the average second moment of the gradient. The use of $\mathbf{g}_t'$ is purely for notational convenience. In practice we can implement this algorithm without the need to use additional temporary space for $\mathbf{g}_t'$. As before $\eta$ is a parameter ensuring nontrivial numerical results, i.e., avoiding zero step size or infinite variance. Typically we set this to $\eta = 10^{-5}$. 

## Implementation

Adadelta needs to maintain two state variables for each variable, $\mathbf{s}_t$ and $\Delta\mathbf{x}_t$. This yields the following implementation.


In [1]:
%use @file[../djl.json]
%use lets-plot
@file:DependsOn("../D2J-1.0-SNAPSHOT.jar")
//import jp.live.ugai.d2j.attention.Chap10Utils
import jp.live.ugai.d2j.util.GradDescUtils.plotGammas
import jp.live.ugai.d2j.util.GradDescUtils.train2d
import jp.live.ugai.d2j.util.GradDescUtils.showTrace2d
import jp.live.ugai.d2j.util.TrainingChapter11.getDataCh11
import jp.live.ugai.d2j.util.TrainingChapter11.plotLossEpoch
import jp.live.ugai.d2j.util.TrainingChapter11.trainCh11
import jp.live.ugai.d2j.util.TrainingChapter11.trainConciseCh11
import jp.live.ugai.d2j.util.LossTime

In [2]:
fun initAdadeltaStates(featureDimension: Int) : NDList {
    val manager = NDManager.newBaseManager();
    val sW = manager.zeros(Shape(featureDimension.toLong(), 1));
    val sB = manager.zeros(Shape(1));
    val deltaW = manager.zeros(Shape(featureDimension.toLong(), 1));
    val deltaB = manager.zeros(Shape(1));
    return NDList(sW, deltaW, sB, deltaB);
}

object Optimization {
    fun adadelta(params: NDList, states: NDList,  hyperparams: Map<String, Float>) {
        val rho = hyperparams.get("rho")!!
        val eps = 1e-5.toFloat()
        for (i in 0 until params.size) {
            val param = params.get(i);
            val state = states.get(2 * i);
            val delta = states.get(2 * i + 1);
            // Update parameter, state, and delta
            // In-place updates with the '__'i methods (ex. muli)
            // state = rho * state + (1 - rho) * param.gradient^2
            state.muli(rho).addi(param.getGradient().square().mul(1 - rho));
            // rescaledGradient = ((delta + eps)^(1/2) / (state + eps)^(1/2)) * param.gradient
           val rescaledGradient = delta.add(eps).sqrt()
                .div(state.add(eps).sqrt()).mul(param.getGradient());
            // param -= rescaledGradient
            param.subi(rescaledGradient);
            // delta = rho * delta + (1 - rho) * g^2
            delta.muli(rho).addi(rescaledGradient.square().mul(1 - rho));
        }
    }
}

Choosing $\rho = 0.9$ amounts to a half-life time of 10 for each parameter update. This tends to work quite well. We get the following behavior.


In [3]:
val airfoil = getDataCh11(10, 1500);

fun trainAdadelta(rho: Float, numEpochs: Int) : LossTime {
    val featureDimension = airfoil.getColumnNames().size
    val hyperparams = mutableMapOf<String, Float>()
    hyperparams.put("rho", rho)
    return trainCh11(Optimization::adadelta, 
                                       initAdadeltaStates(featureDimension), 
                                       hyperparams, airfoil, 
                                       featureDimension, numEpochs)
}

val lossTime = trainAdadelta(0.9f, 2)
plotLossEpoch(lossTime.loss, lossTime.epoch)

loss: 0.244, 0.140 sec/epoch


As usual, for a concise implementation, we simply create an instance of `adadelta` from the `Optimizer` class.

In [6]:
val adadelta = Optimizer.adadelta().optRho(0.9f).build();

val lossTime = trainConciseCh11(adadelta, airfoil, 6);
plotLossEpoch(lossTime.loss, lossTime.epoch)

Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.49
Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.48, L2Loss: 0.48
Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.47█         | Accuracy: 0.67, L2Loss: 0.47, L2Loss: 0.46
Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.45�██████████████       | Accuracy: 0.67, L2Loss: 0.46, L2Loss: 0.45
Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.44
Training:    100% |████████████████████████████████████████| Accuracy: 0.67, L2Loss: 0.43███████████       | Accuracy: 0.67, L2Loss: 0.43
loss: 0.371, 0.216 sec/epoch


## Summary

* Adadelta has no learning rate parameter. Instead, it uses the rate of change in the parameters itself to adapt the learning rate. 
* Adadelta requires two state variables to store the second moments of gradient and the change in parameters. 
* Adadelta uses leaky averages to keep a running estimate of the appropriate statistics. 

## Exercises

1. Adjust the value of $\rho$. What happens?
1. Show how to implement the algorithm without the use of $\mathbf{g}_t'$. Why might this be a good idea?
1. Is Adadelta really learning rate free? Could you find optimization problems that break Adadelta?
1. Compare Adadelta to Adagrad and RMS prop to discuss their convergence behavior.
