Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.normal accepts Variables but does not propagate gradients #4620

Closed
samuela opened this issue Jan 11, 2018 · 23 comments
Closed

torch.normal accepts Variables but does not propagate gradients #4620

samuela opened this issue Jan 11, 2018 · 23 comments

Comments

@samuela
Copy link
Contributor

samuela commented Jan 11, 2018

Simple example:

import torch
from torch.autograd import Variable


mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
loss = torch.pow(torch.normal(mu, sigma), 2)
loss.backward()

produces:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-52a0569421b1> in <module>()
----> 1 loss.backward()

~/stuff/venv/lib/python3.6/site-packages/torch/autograd/variable.py in backward(self, gradient, retain_graph, create_graph, retain_variables)
    165                 Variable.
    166         """
--> 167         torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
    168 
    169     def register_hook(self, hook):

~/stuff/venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(variables, grad_variables, retain_graph, create_graph, retain_variables)
     97 
     98     Variable._execution_engine.run_backward(
---> 99         variables, grad_variables, retain_graph)
    100 
    101 

RuntimeError: element 0 of variables does not require grad and does not have a grad_fn
@samuela
Copy link
Contributor Author

samuela commented Jan 11, 2018

Even worse, somehow I've found a way to successfully run torch.normal and backward() through it but the actual gradients aren't correct even though the code runs without errors. I'll try to get a minimal repro on that behavior as well.

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

Ok, here's another example of the behavior described in the previous comment: backward() runs without error, but the gradients are not actually calculated.

It's something like a VAE. It demonstrates using torch.normal for the reparameterization trick vs. a hand-written version.

import torch
from torch.autograd import Variable

import math


torch.manual_seed(0)

LOG2PI = torch.log(torch.FloatTensor([2 * math.pi]))[0]

class DiagonalMVN(object):
  def __init__(self, mean, log_stddev):
    assert mean.size() == log_stddev.size()
    self.mean = mean
    self.log_stddev = log_stddev

  def sample_bad(self):
    return torch.normal(self.mean, torch.exp(self.log_stddev))

  def sample_good(self):
    return self.mean + torch.exp(self.log_stddev) * Variable(torch.randn(self.mean.size()))

  def logprob(self, x):
    return -0.5 * (
      self.mean.numel() * LOG2PI
      + 2 * torch.sum(self.log_stddev)
      + torch.sum((x - self.mean) * torch.exp(-2 * self.log_stddev) * (x - self.mean))
    )

class FixedVarianceNet(object):
  def __init__(self, mean_net, log_stddev):
    self.mean_net = mean_net
    self.log_stddev = log_stddev

  def __call__(self, x):
    return DiagonalMVN(self.mean_net(x), self.log_stddev)

inference_net = FixedVarianceNet(torch.nn.Linear(2, 1), Variable(torch.zeros(1)))
generative_net = FixedVarianceNet(torch.nn.Linear(1, 2), Variable(torch.zeros(2)))

print('### torch.normal broken!')
Xvar = Variable(torch.randn(2))
vi_posterior = inference_net(Xvar)
loss = -generative_net(vi_posterior.sample_bad()).logprob(Xvar)
loss.backward()

print('inference_net.mean_net.bias.grad ==', inference_net.mean_net.bias.grad)
print()

print('### Custom sample() works...')
Xvar = Variable(torch.randn(2))
vi_posterior = inference_net(Xvar)
loss = -generative_net(vi_posterior.sample_good()).logprob(Xvar)
loss.backward()

print('inference_net.mean_net.bias.grad ==', inference_net.mean_net.bias.grad)

It produces:

### torch.normal broken!
inference_net.mean_net.bias.grad == None

### Custom sample() works...
inference_net.mean_net.bias.grad == Variable containing:
 0.4637
[torch.FloatTensor of size 1]

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

I'm using

torch==0.3.0.post4
torchvision==0.2.0

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

It becomes backprop-able because an nn.Linear participates in the the computation. Its parameter requires gradient, so the autograd engine can execute in your second example.

I don't think torch.normal should be backprop-able as it is a sampling function. IMO, there is a fundamental difference between:

  1. sampling using mu and sigma as distn parameter, and
  2. training mu and sigma using randomly sampled data, i.e. N(0, 1).

But I do agree that this is confusing.

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

It becomes backprop-able because an nn.Linear participates in the the computation. Its parameter requires gradient, so the autograd engine can execute in your second example.

But I've also marked mu and sigma variables as requiring a grad in the first example. What makes them different than nn.Linear in the eyes of backward()?

Personally I would like to lobby for torch.normal (and friends) to be backprop-able, since it's convenient for cases where one wants to use a reparameterization trick. But I would also understand if they were deemed to not be backprop-able. In that case though, shouldn't pytorch throw an error any time it tries to backprop through torch.normal?

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

I recently tracked down a particularly nasty bug that was due to the behavior shown in the second example: no error, but also no gradients.

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

But I've also marked mu and sigma variables as requiring a grad in the first example. What makes them different than nn.Linear in the eyes of backward()?

I believe torch.normal sets its output to not requiring grad.

Personally I would like to lobby for torch.normal (and friends) to be backprop-able, since it's convenient for cases where one wants to use a reparameterization trick.

It is useful in case of the reparameterization trick. However, it still doesn't quite make any sense for backprop to work on sampling methods. For example, what would you say is the "gradient" for discrete distributions? And we should definitely not allow backprop through things like entire MCMC trace, naive RL rewards, etc. When users want noisy gradients, the best way is to let them manually write out the thing like N(0, 1) in my opinion.

I'm not sure that we should throw an error. A warning might be a good solution. But I definitely agree that we should make the doc clearer.

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

I believe torch.normal sets its output to not requiring grad.

I'm still confused as to why one would error and the other would not. Shouldn't I be able to write my own linear layer W @ x + b and have it do the same thing? It looks like nn.Linear is receiving some special treatment here.

It is useful in case of the reparameterization trick. However, it still doesn't quite make any sense for backprop to work on sampling methods. For example, what would you say is the "gradient" for discrete distributions? And we should definitely not allow backprop through things like entire MCMC trace, naive RL rewards, etc. When users want noisy gradients, the best way is to let them manually write out the thing like N(0, 1) in my opinion.

That makes sense. In that case torch.normal should not accept Variables as input, correct?

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

I'm still confused as to why one would error and the other would not. Shouldn't I be able to write my own linear layer W @ x + b and have it do the same thing? It looks like nn.Linear is receiving some special treatment here.

nn.Linear has no special treatment. Here is what happens when it is involved.

  1. mu, sigma both require grad, but torch.normal(mu, sigma) doesn't
  2. torch.normal(mu, sigma), linear.weight and linear.bias interact, produce out. Although torch.normal(mu, sigma) doesn't need grad, linear.weight and linear.bias are nn.Parameter and naturally require grad, so out also does.

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

That makes sense. In that case torch.normal should not accept Variables as input, correct?

Maybe, but Tensor and Variable will soon become the same thing :P

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

nn.Linear has no special treatment. Here is what happens when it is involved.
mu, sigma both require grad, but torch.normal(mu, sigma) doesn't
torch.normal(mu, sigma), linear.weight and linear.bias interact, produce out. Since although torch.normal(mu, sigma) doesn't need grad, linear.weight and linear.bias are nn.Parameter and naturally require grad, so does out.

Ok, so this all makes sense to me. But I don't see why

mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
x = torch.normal(mu, sigma)
loss = torch.pow(x, 2)
loss.backward()

should produce a RuntimeError.

@samuela
Copy link
Contributor Author

samuela commented Jan 12, 2018

Tensor and Variable will soon become the same thing :P

Exciting!

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

produce a RuntimeError.

Without nn.Linear, the loss doesn't require grad because none of the things computes it requires grad. Calling backward on a Variable that don't need grad causes error!

@alicanb
Copy link
Collaborator

alicanb commented Jan 12, 2018

if you are on master, you can do x = torch.distributions.Normal(mu, sigma).rsample()

@talesa
Copy link

talesa commented Jan 15, 2018

The following code doesn't cause an error because var_requiring_grad.requires_grad = True implies loss.requires_grad = True and hence you can call loss.backward()

mu = Variable(torch.Tensor([1]), requires_grad=True)
sigma = Variable(torch.Tensor([1]), requires_grad=True)
x = torch.normal(mu, sigma)

var_requiring_grad = Variable(torch.Tensor([1]), requires_grad=True)
loss = torch.pow(x, 2) + var_requiring_grad
loss.backward()

@ssnl
Copy link
Collaborator

ssnl commented Jan 16, 2018

@talesa yeah but mu and sigma won't have grads.

@apaszke
Copy link
Contributor

apaszke commented Jan 16, 2018

After some discussions we decided that all random ops in autograd should never propagate gradient. If you want a reparametrized sampler use torch.distributions

@apaszke apaszke closed this as completed Jan 16, 2018
@samuela
Copy link
Contributor Author

samuela commented Jan 16, 2018

@apaszke Would it be possible to have a warning or error when you try to use Variables with random ops then? That way it becomes much more difficult to fall into this trap.

@colesbury colesbury reopened this Jan 16, 2018
@colesbury
Copy link
Member

Yeah, the "fix" should be to throw an error in the backwards of:

  • normal
  • log_normal
  • cauchy
  • exponential
  • geometric
  • log_normal
  • bernoulli

Basically, anywhere Generator appears in derivatives.yaml

@Kelym
Copy link

Kelym commented Jan 23, 2018

Just encountered this trap. I came from tensorflow and is hoping to write a policy gradient agent for reinforcement learning, where I need to sample an action tensor from a normal distribution where the mean is the output of a network and the deviation is a variable. I need to propagate to both the mean and the variance.

  1. Would vote for back-prop able distribution because otherwise it is hard to write RL agent in torch.
  2. Can anyone point me to place to check "what vars are affected by loss.backward()"? It can be easily assumed that a lot of things carry gradient (list of variables, etc) and resulted in painful debugging.

@apaszke
Copy link
Contributor

apaszke commented Jan 23, 2018

  1. Starting from 0.4 the recommended way will be to use torch.distributions, which has clear gradient semantics (.sample() doesn't propagate grads, .rsample() does).
  2. In general everything can be differentiated except for the random operations.

@samuela
Copy link
Contributor Author

samuela commented Jan 24, 2018

@colesbury Is there a reason that you say the error should be thrown when calling backwards, as opposed to when calling these functions with Variables? It seems it would be even safer to prevent the entire calling convention itself.

@samuela
Copy link
Contributor Author

samuela commented May 2, 2018

Closing this now that 0.4 has come out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants