New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust ELBO so that baselines don't have to output scaled costs #555

Open
null-a opened this Issue Nov 10, 2017 · 12 comments

Comments

Projects
None yet
5 participants
@null-a
Copy link
Collaborator

null-a commented Nov 10, 2017

#533 removed a superfluous scaling factor from the LR term of the ELBO estimator by undoing the scaling of log_q. I propose that we achieve this by undoing the scaling of downstream_cost instead, like this.

Doing so does not change gradient estimates, but it does rescale the values we're asking baseline nets to output.

For example, here is some debug output from an inference run of the AIR example. (With a batch size of 2.) It shows the output of the baseline network at various points (output: ...) as well as the values we'd like it to output (target: ...).

target: [0.0, 8691153.0]
output: [0.0, 33.2]
i=100, epochs=0.00, elapsed=0.00, elbo=-271.15
target: [0.0, 0.0]
output: [0.0, 0.0]
i=200, epochs=0.01, elapsed=0.01, elbo=-150.59
target: [7280237.5, 0.0]
output: [88.6, 0.0]
i=300, epochs=0.01, elapsed=0.01, elbo=-477.05
target: [0.0, 21047686.0]
output: [0.0, 114.2]
i=400, epochs=0.01, elapsed=0.02, elbo=-704.59
target: [0.0, 21167150.0]
output: [0.0, 140.4]
i=500, epochs=0.02, elapsed=0.02, elbo=-627.35
target: [16254318.0, 0.0]
output: [164.9, 0.0]
i=600, epochs=0.02, elapsed=0.02, elbo=-624.58
target: [0.0, 12181961.0]
output: [0.0, 189.0]
i=700, epochs=0.02, elapsed=0.03, elbo=-557.61
target: [0.0, 0.0]
output: [0.0, 0.0]
i=800, epochs=0.03, elapsed=0.03, elbo=-586.60
target: [21169508.0, 0.0]
output: [235.7, 0.0]
i=900, epochs=0.03, elapsed=0.04, elbo=-448.33
target: [0.0, 21032276.0]
output: [0.0, 260.1]
i=1000, epochs=0.03, elapsed=0.04, elbo=-704.91

As you can see, the target values are large, and even after 1K steps the net's outputs aren't of the expected order.

After making the suggested change however, things look more sensible:

target: [0.0, 229.6]
output: [0.0, 32.9]
i=100, epochs=0.00, elapsed=0.00, elbo=-144.65
target: [0.0, 610.7]
output: [0.0, 61.8]
i=200, epochs=0.01, elapsed=0.01, elbo=-688.94
target: [290.0, 587.9]
output: [86.5, 86.5]
i=300, epochs=0.01, elapsed=0.01, elbo=-516.98
target: [590.7, 0.0]
output: [109.8, 0.0]
i=400, epochs=0.01, elapsed=0.02, elbo=-591.06
target: [532.3, 566.2]
output: [132.9, 132.9]
i=500, epochs=0.02, elapsed=0.02, elbo=-679.37
target: [0.0, 516.3]
output: [0.0, 153.9]
i=600, epochs=0.02, elapsed=0.02, elbo=-492.29
target: [0.0, 405.7]
output: [0.0, 175.5]
i=700, epochs=0.02, elapsed=0.03, elbo=-556.48
target: [237.1, 491.8]
output: [194.0, 194.0]
i=800, epochs=0.03, elapsed=0.03, elbo=-554.03
target: [0.0, 487.9]
output: [0.0, 212.6]
i=900, epochs=0.03, elapsed=0.04, elbo=-705.26
target: [470.2, 470.2]
output: [233.2, 233.2]
i=1000, epochs=0.03, elapsed=0.04, elbo=-697.53

This change also has the effect of making the scale of the baselines targets invariant to the batch size. I'm not sure there are any immediate practical consequences of this, but this seems like a desirable property.

One objection to this change might be that we ought to simply increase the step size instead. This would work, but I suspect that using large targets and correspondingly large step sizes will make it more difficult to compare baseline step sizes with published results. So for that reason, I think making this change is a better approach.

(EDIT: updated to use Github's <details> tags)

@martinjankowiak

This comment has been minimized.

Copy link
Collaborator

martinjankowiak commented Nov 10, 2017

@null-a is 'target' the downstream cost for one sample or is it something else?

@null-a

This comment has been minimized.

Copy link
Collaborator

null-a commented Nov 10, 2017

Yeah, it's downstream_cost. (ETA: for a single sample)

@martinjankowiak

This comment has been minimized.

Copy link
Collaborator

martinjankowiak commented Nov 10, 2017

one thing that's kind of unnatural about that is that in the general case the downstream cost may contain nested subsamples and so multiple scale factors.

maybe instead of being (possibly over-)clever maybe it'd make sense to have SVI take a normalizer argument that is used to scale the entire loss?

@fritzo

This comment has been minimized.

Copy link
Member

fritzo commented Nov 10, 2017

@null-a This change makes sense to me, in that your new baselines are independent of batch size. (so e.g. you could in theory do distributed training on two different machines with different bactch sizes but with shared baselines).

  1. Do you plan to submit a separate PR for this? If so, it would be good to keep this code in sync with Trace_ELBO.
  2. Do you think this should be included in our 0.1.2 patch release?
@ngoodman

This comment has been minimized.

Copy link
Collaborator

ngoodman commented Nov 10, 2017

the baselines targets invariant to the batch size

this seems very desireable to me: i can easily imagine fancy training setups where we change the batch size during training, and it'd be sad if this gave us unuseable baselines.... (this makes me wonder about other cases where we might want reuseable baselines, like missing observations.)

@null-a

This comment has been minimized.

Copy link
Collaborator

null-a commented Nov 10, 2017

one thing that's kind of unnatural

@martinjankowiak Hmm, I don't think I follow, sorry. Could you maybe expand a bit on where this would be problematic?

Do you plan to submit a separate PR for this?

@fritzo: If we go ahead, yeah, I'll make a separate PR. I also agree about Trace_ELBO.

Do you think this should be included in our 0.1.2 patch release?

I don't think so. It doesn't seem urgent, and of course I'd like to understand @martinjankowiak's reservation before submitting a PR.

@martinjankowiak

This comment has been minimized.

Copy link
Collaborator

martinjankowiak commented Nov 10, 2017

well, i mean something like this.

surrogate_elbo = elbo + sum_z log q(z) x DSC(z)

where the z are the non-rep latents and DSC is the downstream cost at node z.

if there's nested subsampling so that there are various scale factors they will appear
in the terms in elbo as well as in the terms in DSC(z). so unless the scale factors appear
everywhere and are the same everywhere (like in a model with only local latent variables and
no globals and only one layer of subsampling like AIR), i don't think your change would actually
yield invariance of baseline under batch size. in general the baseline is integrating over information
at different scales (there are multiple batch sizes) and so i don't think it can be made invariant like
that (unless the baseline is built with a particular architecture to reflect the multi-scale structure of
the sum of costs it's tracking?).

that said, i think your change is reasonable. i just suspect we might end up needing to revisit it down the line.
thus my suggestion: if the automagic thing isn't going to be quite right in all cases, maybe the user should just decide how he or she wants things scaled.

@null-a

This comment has been minimized.

Copy link
Collaborator

null-a commented Nov 13, 2017

@martinjankowiak: Thanks for that, I appreciate you taking the time to explain! However, I'm afraid I still don't see where this goes wrong. :(

Here's a rough attempt to explain why I suspected it would work. Maybe seeing this will make it clear where I'm going wrong!

Imagine this model/guide structure:

def model():
    with pyro.iarange("", 2, subsample_size=b_1) as i1:
        pyro.sample()
        with pyro.iarange("", 2, subsample_size=b_2) as i2:
            pyro.sample()

Here's a picture of the dependencies we track, with * showing sample statements and o showing iarange.

      o
     / \
    /   \
  (*)    *     costs at this level scaled by m_1 = 2/b_1
   |     |
   o     o
  / \   / \
 *   * *   *   costs at this level scaled by m_2 = m_1 * 2/b_2

We scale log_pdfs and therefore 'costs' depending on the level of nesting. The scaling factors at each level are on the picture.

For simplicity imagine that the local cost (log(q/p)) at each node is c.

As an example, focus on the (*) node, then its downstream_cost (the thing the baseline is currently trying to output) is:

downstream_cost for (*) = m_1 * c + m_2 * b_2 * c

The b_2 comes from the fact the we incur a c for each sub-sample we take.

If we divide this by the scaling factor at (*)s level:

scaled_cost = (m_1 * c + m_2 * b_2 * c) / m_1
            = (m_1 * c + m_1 * (2 / b_2) * b_2 * c) / m_1
            = c + 2*c

Then we get something which isn't dependent on the batch size.

I think something similar applies as you extend out with more levels of nesting. e.g. At the next level out you'd encounter b_2 * b_3 choices, but this term disappears from downstream_cost because the scaling factor at that level is m_1 * 2/b_2 * 2/b_3. Any idea what I'm missing?

@martinjankowiak

This comment has been minimized.

Copy link
Collaborator

martinjankowiak commented Dec 17, 2017

@null-a do you think this works for models and guides with different structures?

@null-a

This comment has been minimized.

Copy link
Collaborator

null-a commented Dec 18, 2017

do you think this works for models and guides with different structures?

@martinjankowiak Ah, good question! (I hadn't considered that.) My initial impression is that if it's the case that sub-sampling requires a particular random choice to show up in the same (potentially nested sequence of) iarange in both the model and the guide, then it will probably work out. Since this similarity means that the picture I drew earlier still makes sense, and the same a similar argument goes through even if there is arbitrary internal structure in each *.

(Clearly this would need more thought before we considered implementing this.)

@fritzo

This comment has been minimized.

Copy link
Member

fritzo commented Apr 24, 2018

@martinjankowiak I totally forgot about this issue; IIRC we wanted to get it in before 0.2. I assume it's now too late, correct?

@fritzo fritzo added this to the 0.3.0 release milestone Apr 24, 2018

@martinjankowiak

This comment has been minimized.

Copy link
Collaborator

martinjankowiak commented Apr 24, 2018

well, i'm not sure if there is a clear automagical solution here. or whether a tailored solution is really necessary. for example, if you're worried about big numbers one thing you can do is wrap everything with a scale decorator

@poutine.scale(scale=0.001) or @poutine.scale(scale=1/batch_size)

@fritzo fritzo removed this from the 0.3.0 release milestone Nov 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment