`torch.normal` accepts Variables but does not propagate gradients #4620

samuela · 2018-01-11T23:53:07Z

Simple example:

import torch
from torch.autograd import Variable


mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
loss = torch.pow(torch.normal(mu, sigma), 2)
loss.backward()

produces:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-52a0569421b1> in <module>()
----> 1 loss.backward()

~/stuff/venv/lib/python3.6/site-packages/torch/autograd/variable.py in backward(self, gradient, retain_graph, create_graph, retain_variables)
    165                 Variable.
    166         """
--> 167         torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
    168 
    169     def register_hook(self, hook):

~/stuff/venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(variables, grad_variables, retain_graph, create_graph, retain_variables)
     97 
     98     Variable._execution_engine.run_backward(
---> 99         variables, grad_variables, retain_graph)
    100 
    101 

RuntimeError: element 0 of variables does not require grad and does not have a grad_fn

The text was updated successfully, but these errors were encountered:

samuela · 2018-01-11T23:54:54Z

Even worse, somehow I've found a way to successfully run torch.normal and backward() through it but the actual gradients aren't correct even though the code runs without errors. I'll try to get a minimal repro on that behavior as well.

samuela · 2018-01-12T00:21:20Z

Ok, here's another example of the behavior described in the previous comment: backward() runs without error, but the gradients are not actually calculated.

It's something like a VAE. It demonstrates using torch.normal for the reparameterization trick vs. a hand-written version.

import torch
from torch.autograd import Variable

import math


torch.manual_seed(0)

LOG2PI = torch.log(torch.FloatTensor([2 * math.pi]))[0]

class DiagonalMVN(object):
  def __init__(self, mean, log_stddev):
    assert mean.size() == log_stddev.size()
    self.mean = mean
    self.log_stddev = log_stddev

  def sample_bad(self):
    return torch.normal(self.mean, torch.exp(self.log_stddev))

  def sample_good(self):
    return self.mean + torch.exp(self.log_stddev) * Variable(torch.randn(self.mean.size()))

  def logprob(self, x):
    return -0.5 * (
      self.mean.numel() * LOG2PI
      + 2 * torch.sum(self.log_stddev)
      + torch.sum((x - self.mean) * torch.exp(-2 * self.log_stddev) * (x - self.mean))
    )

class FixedVarianceNet(object):
  def __init__(self, mean_net, log_stddev):
    self.mean_net = mean_net
    self.log_stddev = log_stddev

  def __call__(self, x):
    return DiagonalMVN(self.mean_net(x), self.log_stddev)

inference_net = FixedVarianceNet(torch.nn.Linear(2, 1), Variable(torch.zeros(1)))
generative_net = FixedVarianceNet(torch.nn.Linear(1, 2), Variable(torch.zeros(2)))

print('### torch.normal broken!')
Xvar = Variable(torch.randn(2))
vi_posterior = inference_net(Xvar)
loss = -generative_net(vi_posterior.sample_bad()).logprob(Xvar)
loss.backward()

print('inference_net.mean_net.bias.grad ==', inference_net.mean_net.bias.grad)
print()

print('### Custom sample() works...')
Xvar = Variable(torch.randn(2))
vi_posterior = inference_net(Xvar)
loss = -generative_net(vi_posterior.sample_good()).logprob(Xvar)
loss.backward()

print('inference_net.mean_net.bias.grad ==', inference_net.mean_net.bias.grad)

It produces:

### torch.normal broken!
inference_net.mean_net.bias.grad == None

### Custom sample() works...
inference_net.mean_net.bias.grad == Variable containing:
 0.4637
[torch.FloatTensor of size 1]

samuela · 2018-01-12T00:21:47Z

I'm using

torch==0.3.0.post4
torchvision==0.2.0

ssnl · 2018-01-12T15:40:47Z

It becomes backprop-able because an nn.Linear participates in the the computation. Its parameter requires gradient, so the autograd engine can execute in your second example.

I don't think torch.normal should be backprop-able as it is a sampling function. IMO, there is a fundamental difference between:

sampling using mu and sigma as distn parameter, and
training mu and sigma using randomly sampled data, i.e. N(0, 1).

But I do agree that this is confusing.

samuela · 2018-01-12T18:19:45Z

It becomes backprop-able because an nn.Linear participates in the the computation. Its parameter requires gradient, so the autograd engine can execute in your second example.

But I've also marked mu and sigma variables as requiring a grad in the first example. What makes them different than nn.Linear in the eyes of backward()?

Personally I would like to lobby for torch.normal (and friends) to be backprop-able, since it's convenient for cases where one wants to use a reparameterization trick. But I would also understand if they were deemed to not be backprop-able. In that case though, shouldn't pytorch throw an error any time it tries to backprop through torch.normal?

samuela · 2018-01-12T18:23:39Z

I recently tracked down a particularly nasty bug that was due to the behavior shown in the second example: no error, but also no gradients.

ssnl · 2018-01-12T19:03:00Z

But I've also marked mu and sigma variables as requiring a grad in the first example. What makes them different than nn.Linear in the eyes of backward()?

I believe torch.normal sets its output to not requiring grad.

Personally I would like to lobby for torch.normal (and friends) to be backprop-able, since it's convenient for cases where one wants to use a reparameterization trick.

It is useful in case of the reparameterization trick. However, it still doesn't quite make any sense for backprop to work on sampling methods. For example, what would you say is the "gradient" for discrete distributions? And we should definitely not allow backprop through things like entire MCMC trace, naive RL rewards, etc. When users want noisy gradients, the best way is to let them manually write out the thing like N(0, 1) in my opinion.

I'm not sure that we should throw an error. A warning might be a good solution. But I definitely agree that we should make the doc clearer.

samuela · 2018-01-12T20:23:00Z

I believe torch.normal sets its output to not requiring grad.

I'm still confused as to why one would error and the other would not. Shouldn't I be able to write my own linear layer W @ x + b and have it do the same thing? It looks like nn.Linear is receiving some special treatment here.

It is useful in case of the reparameterization trick. However, it still doesn't quite make any sense for backprop to work on sampling methods. For example, what would you say is the "gradient" for discrete distributions? And we should definitely not allow backprop through things like entire MCMC trace, naive RL rewards, etc. When users want noisy gradients, the best way is to let them manually write out the thing like N(0, 1) in my opinion.

That makes sense. In that case torch.normal should not accept Variables as input, correct?

ssnl · 2018-01-12T21:21:24Z

I'm still confused as to why one would error and the other would not. Shouldn't I be able to write my own linear layer W @ x + b and have it do the same thing? It looks like nn.Linear is receiving some special treatment here.

nn.Linear has no special treatment. Here is what happens when it is involved.

mu, sigma both require grad, but torch.normal(mu, sigma) doesn't
torch.normal(mu, sigma), linear.weight and linear.bias interact, produce out. Although torch.normal(mu, sigma) doesn't need grad, linear.weight and linear.bias are nn.Parameter and naturally require grad, so out also does.

ssnl · 2018-01-12T21:37:27Z

That makes sense. In that case torch.normal should not accept Variables as input, correct?

Maybe, but Tensor and Variable will soon become the same thing :P

samuela · 2018-01-12T21:41:10Z

nn.Linear has no special treatment. Here is what happens when it is involved.
mu, sigma both require grad, but torch.normal(mu, sigma) doesn't
torch.normal(mu, sigma), linear.weight and linear.bias interact, produce out. Since although torch.normal(mu, sigma) doesn't need grad, linear.weight and linear.bias are nn.Parameter and naturally require grad, so does out.

Ok, so this all makes sense to me. But I don't see why

mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
x = torch.normal(mu, sigma)
loss = torch.pow(x, 2)
loss.backward()

should produce a RuntimeError.

samuela · 2018-01-12T21:41:20Z

Tensor and Variable will soon become the same thing :P

Exciting!

ssnl · 2018-01-12T21:42:43Z

produce a RuntimeError.

Without nn.Linear, the loss doesn't require grad because none of the things computes it requires grad. Calling backward on a Variable that don't need grad causes error!

alicanb · 2018-01-12T22:37:51Z

if you are on master, you can do x = torch.distributions.Normal(mu, sigma).rsample()

talesa · 2018-01-15T11:57:58Z

The following code doesn't cause an error because var_requiring_grad.requires_grad = True implies loss.requires_grad = True and hence you can call loss.backward()

mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
x = torch.normal(mu, sigma)

var_requiring_grad = Variable(torch.Tensor([1]), requires_grad=True)
loss = torch.pow(x, 2) + var_requiring_grad
loss.backward()

ssnl · 2018-01-16T16:01:28Z

@talesa yeah but mu and sigma won't have grads.

apaszke · 2018-01-16T16:32:03Z

After some discussions we decided that all random ops in autograd should never propagate gradient. If you want a reparametrized sampler use torch.distributions

samuela · 2018-01-16T18:34:40Z

@apaszke Would it be possible to have a warning or error when you try to use Variables with random ops then? That way it becomes much more difficult to fall into this trap.

colesbury · 2018-01-16T19:49:11Z

Yeah, the "fix" should be to throw an error in the backwards of:

normal
log_normal
cauchy
exponential
geometric
log_normal
bernoulli

Basically, anywhere Generator appears in derivatives.yaml

Kelym · 2018-01-23T08:22:23Z

Just encountered this trap. I came from tensorflow and is hoping to write a policy gradient agent for reinforcement learning, where I need to sample an action tensor from a normal distribution where the mean is the output of a network and the deviation is a variable. I need to propagate to both the mean and the variance.

Would vote for back-prop able distribution because otherwise it is hard to write RL agent in torch.
Can anyone point me to place to check "what vars are affected by loss.backward()"? It can be easily assumed that a lot of things carry gradient (list of variables, etc) and resulted in painful debugging.

apaszke · 2018-01-23T09:33:54Z

Starting from 0.4 the recommended way will be to use torch.distributions, which has clear gradient semantics (.sample() doesn't propagate grads, .rsample() does).
In general everything can be differentiated except for the random operations.

samuela · 2018-01-24T07:38:44Z

@colesbury Is there a reason that you say the error should be thrown when calling backwards, as opposed to when calling these functions with Variables? It seems it would be even safer to prevent the entire calling convention itself.

samuela · 2018-05-02T01:14:05Z

Closing this now that 0.4 has come out.

apaszke closed this as completed Jan 16, 2018

colesbury reopened this Jan 16, 2018

samuela closed this as completed May 2, 2018

This was referenced Sep 18, 2020

Laplace and exponential noise distributions pytorch/opacus#59

Closed

Standardized Distributions #45115

Open

lucidrains mentioned this issue Jan 22, 2021

Fix: in slot_attention.py the slot initialization parameters were not learnable. lucidrains/slot-attention#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch.normal` accepts Variables but does not propagate gradients #4620

`torch.normal` accepts Variables but does not propagate gradients #4620

samuela commented Jan 11, 2018

samuela commented Jan 11, 2018

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018 •

edited

ssnl commented Jan 12, 2018 •

edited

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018 •

edited

alicanb commented Jan 12, 2018

talesa commented Jan 15, 2018

ssnl commented Jan 16, 2018

apaszke commented Jan 16, 2018

samuela commented Jan 16, 2018

colesbury commented Jan 16, 2018

Kelym commented Jan 23, 2018

apaszke commented Jan 23, 2018 •

edited

samuela commented Jan 24, 2018

samuela commented May 2, 2018

torch.normal accepts Variables but does not propagate gradients #4620

torch.normal accepts Variables but does not propagate gradients #4620

Comments

samuela commented Jan 11, 2018

samuela commented Jan 11, 2018

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018 • edited

ssnl commented Jan 12, 2018 • edited

samuela commented Jan 12, 2018

samuela commented Jan 12, 2018

ssnl commented Jan 12, 2018 • edited

alicanb commented Jan 12, 2018

talesa commented Jan 15, 2018

ssnl commented Jan 16, 2018

apaszke commented Jan 16, 2018

samuela commented Jan 16, 2018

colesbury commented Jan 16, 2018

Kelym commented Jan 23, 2018

apaszke commented Jan 23, 2018 • edited

samuela commented Jan 24, 2018

samuela commented May 2, 2018

`torch.normal` accepts Variables but does not propagate gradients #4620

`torch.normal` accepts Variables but does not propagate gradients #4620

ssnl commented Jan 12, 2018 •

edited

ssnl commented Jan 12, 2018 •

edited

ssnl commented Jan 12, 2018 •

edited

apaszke commented Jan 23, 2018 •

edited