Constant Stochastic Gradient Descent #2544

shkr · 2017-09-05T17:20:18Z

Hey,

I recently came across this publication Stochastic Gradient Descent as Approximate Bayesian Inference https://arxiv.org/pdf/1704.04289v1.pdf which I found interesting.

In comparison to Stochastic Gradient Fisher Scoring which uses a preconditioning matrix to sample
from the posterior even with decreasing learning rates, this work uses optimal constant learning rates such that Kullback-Leibler divergence between the stationary distribution of SGD and the posterior is minimized.

It approximates stochastic variational inference while SGFS and many other MCMC techniques
converge towards the exact posterior. In comparison to SGFS the paper
discusses a proof for the optimal preconditioning matrix
based on variational inference, so the preconditioning matrix is not an input.

I have implemented it by extending the BaseStochasticGradient class introduced in the SGFS PR.

I am submitting this PR before it is complete, to get feedback on this algorithm.

shkr · 2017-09-05T17:23:23Z

I am unable to understand what is meant by the statement We show projections on the smallest and largest principal component of the posterior in Figure 1. Any help on how to calculate these projections of a posterior ?

I want to replicate the results from Fig 1, 2 and 3.

twiecki · 2017-09-05T18:24:49Z

This is a great start @shkr.

To do the PCA projection you simply run SVD on the covariance matrix of the posterior. That gives you M, V, and M.T. You can just take the first and last columns of the V matrix and make that your projection matrix. Let me know if that's not clear.

shkr · 2017-09-10T00:08:16Z

Okay.
I calculated the covariance of the posterior [ P ] of shape (Q x S)
Q : number of parameters
S : number of samples in trace

Sigma = { P - mean(P) } * { P - mean(P) }.T of shape (Q x Q)

Then I selected the first and last row from V.h from the SVD decomposition { Sigma = U S V.H }
Afterwards I projected S samples of size (Q x 1) from the trace to the first and last component.

This gives me a S number of 2 dimensional vectors.

Is that what is being done ? I am trying to interpret these projections, but I think I am doing something incorrect. Can you confirm the ^above steps ?

shkr · 2017-09-20T02:21:17Z

^ @twiecki any comments ?

twiecki · 2017-09-25T13:05:05Z

@shkr Sorry, I've been on vacation, will try to take a look soon.

twiecki · 2017-09-26T14:00:15Z

@shkr I think that's close. However, I think they just compute the principle components of the posterior once (e.g. on the NUTS samples) and then project the individual traces into that.

twiecki · 2017-09-28T09:50:04Z

It does seem to work much better than SGFS, what are your conclusions?

fonnesbeck · 2017-10-03T18:47:22Z

I actually used this as an example of extending PyMC3 in a presentation last week. It worked really well!

shkr · 2017-10-03T20:14:07Z

@twiecki I have updated the notebook. The posterior from the CSG does generate a good approximation for the posterior, figures are falling in line with the paper. I expected independent projections on the largest and smallest eigenvectors of the sample covariance matrix, however it is not true for SGFS and it is possibly because of two untuned hyper parameters in SGFS. No such tuning is required for CSG since there are theoretical optimal values for all the hyper parameters. I want to try using CSG for hyper parameter optimization of the lasso model before making final conclusions and request for merge.

twiecki · 2017-11-09T13:50:24Z

@shkr I somewhat forgot about this but I'm fairly excited about the work you've put in here. Do you think we should include the sampler in the code base? Seems like it's preferable over SGFS.

shkr · 2017-11-09T17:23:07Z

@twiecki Yes. I was busy with some other work, so I was unable to push the update here. I will be pushing a commit, this weekend at the latest. It will be ready for review/merge.

shkr · 2017-11-09T17:23:47Z

And yes, I agree the CSG is preferable over SGFS

shkr · 2017-11-10T06:25:57Z

@twiecki @fonnesbeck some debugging help required.

I am unable to understand, why theano is throwing a disconnected input error here.

As per the modelmu is the laplacian, and s is the regularizer parameter, for the distribution.
I would expect the gradient of the ohs_var which is dependent on mu to have a gradient wrt to s.

shkr · 2017-11-14T04:58:19Z

That question is non-blocking to this PR. I just ran into that error while trying to implement the hyper parameter section of the paper. But, having thought about few use cases, it does not make sense, since hyper parameters such as the the number of nodes in a neural network are non-differentiable. So I am a bit unclear, how the EM routine is helpful for general problems.

shkr · 2017-11-14T05:09:39Z

I have updated sgmcmc.py and I have created a new notebook showing usage of ConstantStochasticGradient. This PR is ready for merge.

twiecki · 2017-11-16T13:11:18Z

pymc3/step_methods/sgmcmc.py

+            Theano variables, default continuous vars
+        kwargs: passed to BaseHMC
+        """
+        super(ConstantStochasticGradient, self).__init__(vars, **kwargs)


Should add an experimental warning.

The experimental warning is present in this line https://github.com/shkr/pymc3/blob/45a45ab0f78b480b8accb27f168164bb213cd280/pymc3/step_methods/sgmcmc.py#L112

twiecki · 2017-11-16T13:13:12Z

Can you also add a note to the RELEASE-NOTES?

junpenglao · 2017-11-16T13:40:45Z

pymc3/step_methods/sgmcmc.py

@@ -298,3 +311,98 @@ def competence(var, has_grad):
        if var.dtype in continuous_types and has_grad:
            return Competence.COMPATIBLE
        return Competence.INCOMPATIBLE
+
+
+class ConstantStochasticGradient(BaseStochasticGradient):


maybe a shorter name?

I can change it to CSG, just like what I did with SGFS. Is that okay ?

Yeah, it's not great but for consistency probably the best option.

I have renamed it

shkr · 2017-11-27T02:08:08Z

@twiecki I have inserted a line in the RELEASE-NOTES and added my name as a community member. Let me know if thats what you wanted.

twiecki · 2017-11-28T10:55:19Z

RELEASE-NOTES.md

@@ -249,6 +250,7 @@ Patricio Benavente <patbenavente@gmail.com>
 Raymond Roberts
 Rodrigo Benenson <rodrigo.benenson@gmail.com>
 Sergei Lebedev <superbobry@gmail.com>
+Shashank Shekhar <shashank.f1@gmail.com>


Can you add yourself instead to a new section Contributors for the upcoming 3.3 release?

twiecki · 2017-11-28T10:57:45Z

Can you also add the NB to the docs? And make sure you only have one top-level heading # in the NB.

shkr · 2017-11-30T04:58:06Z

@twiecki Done ! I added the 3 notebooks I have created for stochastic algorithms to the examples doc.

junpenglao · 2017-11-30T05:43:31Z

Great job @shkr! Just a nitpick: in constant_stochastic_gradient.ipynb you still have

top-level heading #

at the end. You should do # Result --> ## Result

shkr · 2017-11-30T06:01:44Z

@junpenglao done !

twiecki · 2017-11-30T13:47:00Z

************* Module pymc3.sampling

pymc3/sampling.py:13: [W0611(unused-import), ] Unused CSG imported from step_methods

pymc3/sampling.py:13: [W0611(unused-import), ] Unused SGFS imported from step_methods

Would also be curious how CSG does on the neural network, but this doesn't have to be part of this PR.

shkr · 2017-11-30T16:21:35Z

@twiecki Yes. I agree. I will put up a follow up PR, with CSG on the neural net and other notebook updates to the stochastic gradient docs.

twiecki · 2017-12-01T10:01:29Z

RELEASE-NOTES.md

@@ -7,7 +7,8 @@

 - Improve NUTS initialization `advi+adapt_diag_grad` and add `jitter+adapt_diag_grad` (#2643) 
 - Update loo, new improved algorithm (#2730)
-
+- New CSG (Constant Stochastic Gradient) approximate posterior sampling
+  algorithm added


Link to PR like above.

twiecki · 2017-12-01T10:01:43Z

docs/source/examples.rst

+
+
+Stochastic Gradient 
+=====================


twiecki · 2017-12-01T10:02:46Z

pymc3/step_methods/sgmcmc.py

+    https://github.com/pymc-devs/pymc3/tree/master/docs/source/notebooks/constant_stochastic_gradient.ipynb
+
+    Parameters
+    -----


Make line as long as text above it

twiecki · 2017-12-02T22:44:31Z

Thanks @shkr, this is a significant contribution!

twiecki · 2017-12-02T23:24:20Z

Just tried running this on a larger NN but output, _ = theano.scan(lambda i, logX=logL, v=var: theano.grad(logX[i], v).flatten(),\ ---> 56 sequences=[tt.arange(logL.shape[0])]) seems to take forever. Is there no way to vectorize this?

* add csg * Fig 1 and likelihood plotted * posterior comparison * csg nb and python file updated * ConstantStochasticGradient renamed as CSG * inserted update in RELEASE-NOTES * nb updated and added to examples

add csg

c5845cb

shkr force-pushed the csgb branch 3 times, most recently from d16a6cc to 3ce1537 Compare September 16, 2017 22:08

shkr force-pushed the csgb branch from 3ce1537 to ecae55a Compare September 20, 2017 07:13

Fig 1 and likelihood plotted

ae5eda9

shkr force-pushed the csgb branch from ecae55a to ae5eda9 Compare September 20, 2017 19:24

posterior comparison

cc966c9

shkr force-pushed the csgb branch from ec40cc1 to cc966c9 Compare October 4, 2017 00:37

csg nb and python file updated

eb4cb6d

twiecki reviewed Nov 16, 2017

View reviewed changes

junpenglao reviewed Nov 16, 2017

View reviewed changes

ConstantStochasticGradient renamed as CSG

45a45ab

shkr force-pushed the csgb branch from 25ec6b9 to 45a45ab Compare November 27, 2017 02:05

shkr added 2 commits November 26, 2017 18:11

sgmcmc.py merge conflict fixed

4b25fa6

inserted update in RELEASE-NOTES

08e0617

shkr force-pushed the csgb branch from b525fe0 to 08e0617 Compare November 27, 2017 02:18

twiecki reviewed Nov 28, 2017

View reviewed changes

shkr force-pushed the csgb branch from 123069e to 639b0ba Compare November 30, 2017 04:57

shkr force-pushed the csgb branch from 639b0ba to 18e9e32 Compare November 30, 2017 05:07

shkr force-pushed the csgb branch from 18e9e32 to c554458 Compare November 30, 2017 06:01

shkr force-pushed the csgb branch from c554458 to 760e71b Compare November 30, 2017 07:49

shkr force-pushed the csgb branch from 760e71b to 3dc2727 Compare November 30, 2017 16:16

twiecki reviewed Dec 1, 2017

View reviewed changes

docs/source/examples.rst Outdated

Stochastic Gradient

=====================

Copy link

Member

twiecki Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra ==

twiecki reviewed Dec 1, 2017

View reviewed changes

nb updated and added to examples

07ab6ca

shkr force-pushed the csgb branch from 3dc2727 to 07ab6ca Compare December 1, 2017 19:28

twiecki merged commit 0a72bca into pymc-devs:master Dec 2, 2017

shkr deleted the csgb branch May 13, 2018 20:07

Akamoha mentioned this pull request Jan 25, 2019

Move SFGS, CSG and Elliptical Slice samplers to pymc3-experiments #3353

Closed

Constant Stochastic Gradient Descent #2544

Constant Stochastic Gradient Descent #2544

Conversation

shkr commented Sep 5, 2017 • edited

shkr commented Sep 5, 2017

twiecki commented Sep 5, 2017

shkr commented Sep 10, 2017 • edited

shkr commented Sep 20, 2017

twiecki commented Sep 25, 2017

twiecki commented Sep 26, 2017

twiecki commented Sep 28, 2017

fonnesbeck commented Oct 3, 2017

shkr commented Oct 3, 2017 • edited

twiecki commented Nov 9, 2017

shkr commented Nov 9, 2017

shkr commented Nov 9, 2017

shkr commented Nov 10, 2017

shkr commented Nov 14, 2017

shkr commented Nov 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twiecki commented Nov 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkr commented Nov 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twiecki commented Nov 28, 2017 • edited

shkr commented Nov 30, 2017

junpenglao commented Nov 30, 2017

shkr commented Nov 30, 2017

twiecki commented Nov 30, 2017

shkr commented Nov 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twiecki commented Dec 2, 2017

twiecki commented Dec 2, 2017

shkr commented Sep 5, 2017 •

edited

shkr commented Sep 10, 2017 •

edited

shkr commented Oct 3, 2017 •

edited

shkr commented Nov 27, 2017 •

edited

twiecki commented Nov 28, 2017 •

edited