Highlighting in the documentation that square root comes before addin…

…g epsilon. #23796 ghstack-source-id: 6c4dbd396edeb987c422ec69fa32b60840b3d108 Pull Request resolved: #26735
pytorch · Sep 25, 2019 · 014019d · 014019d
1 parent 5001ec4
commit 014019d
Showing 1 changed file with 7 additions and 1 deletion.
diff --git a/torch/optim/rmsprop.py b/torch/optim/rmsprop.py
@@ -3,14 +3,20 @@
 
 
 class RMSprop(Optimizer):
-    """Implements RMSprop algorithm.
+    r"""Implements RMSprop algorithm.
 
     Proposed by G. Hinton in his
     `course <http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf>`_.
 
     The centered version first appears in `Generating Sequences
     With Recurrent Neural Networks <https://arxiv.org/pdf/1308.0850v5.pdf>`_.
 
+    The implementation here takes the square root of the gradient average before
+    adding epsilon (note that TensorFlow interchanges these two operations). The effective
+    learning rate is thus :math:`\alpha/(\sqrt{v} + \epsilon)` where :math:`\alpha`
+    is the scheduled learning rate and :math:`v` is the weighted moving average
+    of the squared gradient.
+
     Arguments:
         params (iterable): iterable of parameters to optimize or dicts defining
             parameter groups