New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No gradients calculated with dense variational layers #409
Comments
Fixing def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
n = kernel_size + bias_size
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent( # pylint: disable=g-long-lambda
tfd.Normal(loc=t[..., :n],
scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior_trainable(kernel_size, bias_size=0, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(
lambda t: tfd.Independent(tfd.Normal(loc=t, scale=1), # pylint: disable=g-long-lambda
reinterpreted_batch_ndims=1)),
])
self.dense1 = tfp.layers.DenseVariational(100, posterior_mean_field, prior_trainable, activation=tf.nn.relu, kl_weight=1/training_size))
self.dense2 = tfp.layers.DenseFlipout(10, posterior_mean_field, prior_trainable, kl_weight=1/training_size)) As for the reason why |
We are also experiencing problems with this. Here's an even simpler example to recreate the error!
Throws the following error after a single successful gradient update.
|
Good news, we've recently developed a way to fix these issues in a generic and backwards compatible way, a fix for the variational layers should be coming soon. |
Nice! Looking forward to the fix. |
which one should be correct when we divide the model.losses, total number of training data or number of batches? |
As all things I'd say it depends, and in practice I've seen both. We assume we should be less certain about or view of the world with fewer training examples, meaning we're strongly tied to our prior (i.e. large contribution from KL loss in our total objective). In that context dividing by the total number of examples makes sense. In practice using mini-batch stochastic optimisation, I weight the KL loss term by mini-batch size and restrict that contribution on an epoch basis. I've played with other more complex weighting strategies, and in the end I don't find large changes in my final weight distributions assuming I'm not doing anything extreme. Blundell et al. have a pretty straightforward strategy defined in their 2015 paper if you're looking for a reference (https://arxiv.org/abs/1505.05424). I think the more interesting question is which prior, and what don't we know about what we don't know. |
Hi, is there a way to work around this no gradient problem in tfp0.7? i tried on latest tfp, it's working fine. but for some reason i need to run my code on tf1.14 only. appreciated if some infor can be provided~ many thanks |
There is no easy workaround, as the fix involved quite a number of changes:
Is it conceivable for you to use a tfp-nightly package? They still have the ones going back a few months on pypy. If you grab one from sometime in August it should still only require tf1.14. |
Hi, thanks for the reply. I have tried the nightly version. it works ok. May i check if any approach we can further push to 1.12? i checked the nightly, even most early versions like 0.6.xx can not be loaded with tf1.12 |
@SiegeLordEx This is still a problem with the following versions of TensorFlow and TFP
Specifically, I am getting the error
where Note that I am not using gradient tape. I am using This and many other bugs in TF and TFP are far from being solved! |
@nbro I'll take a look at the stackoverflow code, but for your trainable prior woes, could you share how you define the |
@SiegeLordEx I am using the default prior that the class The only different thing that my I had initially opened this issue: #887. Meanwhile, I think I cannot directly change the property I tried to do that and, apparently, the variable changes, but I am getting unexpected values for the KL divergence when the optimizer (or fit) prints out the loss. See https://stackoverflow.com/q/61371627/3924118. In particular, the KL divergence that I compute manually in the callback is different than the KL divergence that the optimizer prints (in the progress bar) for the training data. The KL divergence that I compute manually in the callback at each step of an epoch is more similar to the KL divergence for the validation data. I am really lost and stuck (and I can't progress my work). I can't understand what's going on. I tried to use (Btw, I was expecting the KL divergence for the training and test sets to be similar, but maybe this is another issue). |
The default prior doesn't create variables, yet your error message has things like: |
@SiegeLordEx Ha, sorry, I had forgotten about this. I am using the following function to initialise
|
@SiegeLordEx Am i correct in interpreting your answer as DenseFlipout layer being completely functional now? I am asking this because I'm currently working on a bayesian neural network and have implemented this using these layers. I'm still working on it to obtain better results but have stumbled on some posts talking about the DenseFlipout layer being outdated (#359 (comment)_). In short I would want to know if it's safe and correct to keep using this functionality or if it is advised to look into the DenseVariational layer. Thanks in advance! |
They should be functional. Strictly speaking DenseVariational is newer, but in practice both should work with slightly different API. If you want even more options, there's |
Hi,
I'm trying to test a Bayesian approach I'm working on against some of the new variational layers (dense for now) in tfp. I'm just trying to throw together a quick and dirty working example on MNIST just to see how tfp's variational layers comparatively perform.
I'm running into troubles calculating gradients for 'DenseReparameterization' and 'DenseFlipout' layers using the latest nightly builds of tf and tfp, along with eager execution and gradient tape.
Am I missing something simple here, or is there a deeper issue?
Basic example below:
The text was updated successfully, but these errors were encountered: