New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTS refactoring towards GPU support #2345

Merged
merged 8 commits into from Jul 20, 2017

Conversation

Projects
None yet
7 participants
@aseyboldt
Copy link
Member

aseyboldt commented Jun 23, 2017

So far we have mostly investigated possibilities of moving more of nuts into theano. But it might be better to go the opposite direction, and limit theano to computing logp and gradient. The remaining computations are mostly simple BLAS1 operations, so we should be able to do those quickly on a gpu or cpu on our own.
This was meant as an experiment to see how much slower it would be if we did the leapfrog steps in pure python, but to my surprise I get from 5% to 60% faster sampling.
I'm not sure I understand why that is yet, but my guess is that the shared computations between logp and gradient are responsible. Usually theano should be smart enough to do those only once, but if the leapfrog steps are part of the graph this seems to break. In complicated models the resulting runtime cost can be high.

CC @ColCarroll @hvasbath

@aseyboldt aseyboldt changed the title Use NUTS refactoring towards GPU support Jun 23, 2017

@aseyboldt aseyboldt added gpu WIP labels Jun 23, 2017

@ColCarroll
Copy link
Member

ColCarroll left a comment

This looks really cool! Left some comments just to keep myself engaged. I will try to do some benchmarking this weekend and see if I get similar speedups, and under what conditions.

self._dtype = theano.config.floatX
self._ctx = gpu_ctx

"""

This comment has been minimized.

@ColCarroll

ColCarroll Jun 23, 2017

Member

these lines are left over experiments, yeah?

This comment has been minimized.

@aseyboldt

aseyboldt Jun 23, 2017

Author Member

ups, I'll take those out on the next push.

def velocity(self, x):
def velocity(self, x, out=None):
if out is not None:
out[:] = self.v

This comment has been minimized.

@ColCarroll

ColCarroll Jun 23, 2017

Member

any reason for this syntax?

This comment has been minimized.

@ColCarroll

ColCarroll Jun 23, 2017

Member

oh, does it do automatic shape checking/something clever with reusing allocated memory?

This comment has been minimized.

@aseyboldt

aseyboldt Jun 23, 2017

Author Member

Yes, I'm trying to avoid allocations. Haven't done much profiling yet though.

def energy(self, x):
def energy(self, x, velocity=None):
if velocity is not None:
return 0.5 * scipy.linalg.blas.ddot(x, velocity)

This comment has been minimized.

@ColCarroll

ColCarroll Jun 23, 2017

Member

this is x.dot(velocity), but modestly extra faster, I guess? (just ran a silly benchmark, and it seems like it is ~30% faster)

This comment has been minimized.

@aseyboldt

aseyboldt Jun 23, 2017

Author Member

yes. I thought we might as well just use the blas version directly. This would probably also be what we want to call on a gpu. (Although pygpu doesn't expose level 1 BLAS at the moment...)

@junpenglao

This comment has been minimized.

Copy link
Member

junpenglao commented Jun 23, 2017

If this works it will be also much easier to move to other backends (PyTorch?).

@hvasbath

This comment has been minimized.

Copy link
Contributor

hvasbath commented Jun 23, 2017

I always had the feeling that theano is very slow in some cases. I guess thats the cost of wrapping everything into layers and layers of code. Some in C some in python and then the communication overhead is high. I know too little about nuts and pygpu not at all so I cannot comment on the code but thanks for cc anyways as this touches the other issue. Can you profile how much is being shifted back and forth to the GPU?

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jun 23, 2017

@hvasbath It's not running on the gpu yet, this is still about figuring out how we could do that in the future.
I don't think calling overhead is much of a problem in most of my models at least. Most of the ops are C anyway. I think the reason for the particular slowdown this PR fixes has more to do with what theano is computing in the first place. There are many cases where the gradient and the value both contain the same expression somewhere. Theano analyses the graph to find nodes that occur more than once, but if this expression is first changed by some other optimization, then that doesn't work anymore. This might be why it ends up recomputing something twice. I'm still not sure about the exact reason, but just from the Apply node counts, it definitely does some things more often before the PR than after. If those things happen to be very expensive in a particular model, than we get a large speedup.

@hvasbath

This comment has been minimized.

Copy link
Contributor

hvasbath commented Jun 24, 2017

You could try switching optimizations of? Whats the timing then? If you know there are things used repeatedly in the graph couldnt you then program it explicitely in the way not to recompute it?

@ferrine

This comment has been minimized.

Copy link
Member

ferrine commented Jun 24, 2017

Interesting, waiting formenchmarks

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jun 24, 2017

Edit: There was a bug in my profiling code, I updated the times. The new time went up from 7ms to 14ms. (still much faster than master, but not as much :-(

@ferrine I just tried it on an old model with ~30000 variables I still had around:

Before

N = model.ndim
with model:
    step = pm.step_methods.hmc.base_hmc.BaseHMC(
        scaling=np.ones(N), profile=False, use_single_leapfrog=True)

np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
q_grad = model.dlogp_array(q)
%timeit _ = step.leapfrog(q, p, q_grad, epsilon)
20.8 ms ± 889 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Profiling:

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  82.7%    82.7%       1.327s       1.02e-04s     C    13041     161   theano.tensor.elemwise.Elemwise
  10.6%    93.3%       0.170s       5.51e-05s     C     3078      38   theano.tensor.elemwise.Sum
   5.7%    99.0%       0.092s       2.85e-04s     Py     324       4   theano.sparse.basic.Dot
   0.2%    99.2%       0.003s       2.10e-05s     C      162       2   theano.tensor.basic.Join
   0.2%    99.4%       0.003s       9.77e-06s     C      324       4   theano.tensor.basic.Alloc
   0.1%    99.5%       0.002s       2.79e-05s     C       81       1   theano.tensor.blas_c.CGemv
   0.1%    99.7%       0.002s       1.06e-06s     C     1944      24   theano.tensor.subtensor.Subtensor
   0.1%    99.8%       0.002s       7.30e-07s     C     2592      32   theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.001s       1.39e-05s     C       81       1   theano.compile.ops.DeepCopyOp
   0.1%    99.9%       0.001s       8.60e-07s     C     1134      14   theano.tensor.basic.Reshape
   0.0%    99.9%       0.000s       3.44e-07s     C     1134      14   theano.compile.ops.ViewOp
   0.0%   100.0%       0.000s       7.84e-07s     C      405       5   theano.tensor.basic.ScalarFromTensor
   0.0%   100.0%       0.000s       3.05e-07s     C      810      10   theano.compile.ops.Rebroadcast
   0.0%   100.0%       0.000s       2.12e-06s     C       81       1   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       1.26e-06s     C       81       1   theano.tensor.basic.AllocEmpty
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

After

N = model.ndim
with model:
    step = pm.step_methods.hmc.base_hmc.BaseHMC(
        scaling=np.ones(N), use_single_leapfrog=True, profile=False)

np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
state = step.integrator.compute_state(q, p)
%timeit step.integrator.step(epsilon, state)
14.4 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Profiling:

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  86.6%    86.6%       9.967s       1.29e-04s     C    77140      95   theano.tensor.elemwise.Elemwise
   8.6%    95.2%       0.991s       5.31e-05s     C    18676      23   theano.tensor.elemwise.Sum
   3.9%    99.0%       0.445s       2.74e-04s     Py    1624       2   theano.sparse.basic.Dot
   0.4%    99.4%       0.047s       4.82e-06s     C     9744      12   theano.tensor.subtensor.IncSubtensor
   0.3%    99.7%       0.036s       1.50e-05s     C     2436       3   theano.tensor.basic.Alloc
   0.1%    99.8%       0.009s       7.26e-07s     C    12180      15   theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.009s       8.97e-07s     C     9744      12   theano.tensor.subtensor.Subtensor
   0.1%   100.0%       0.006s       6.67e-07s     C     9744      12   theano.tensor.basic.Reshape
   0.0%   100.0%       0.002s       3.57e-07s     C     5684       7   theano.compile.ops.ViewOp
   0.0%   100.0%       0.002s       3.97e-07s     C     4872       6   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       1.04e-06s     C      812       1   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
@junpenglao

This comment has been minimized.

Copy link
Member

junpenglao commented Jun 24, 2017

wow. I will try on some of my slow models.

@hvasbath

This comment has been minimized.

Copy link
Contributor

hvasbath commented Jun 25, 2017

Did you find out where the speedup comes from? If you compare the numbers in the profiling it is not so clear-at least not for me. The state calculation is included in leapfrog for case1 isnt it? So to be a fair comparison you would need to time that as well? Or is that the point that this is stuff that doesnt need to be recalculated all the time?

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jun 25, 2017

@hvasbath They both do just one leapfrog step. The compute_state is pretty much the same as q_grad = model.dlogp_array(q).
I still haven't really figured it out. I'm trying to look at the theano graphs of some very simple models. What makes this difficult is that those graphs get very large very quickly. :-)
But have a look at the number of Apply nodes in the profile above (#apply). In the model I do one multiplication with a sparse matrix. So I would expect to see an apply count of 2 for theano.sparse.basic.Dot: One for computing the value, and one because of the gradient. This is the case in the second run. But in master there are 4 of those.

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jun 25, 2017

I think I have a rough idea now where the problem in master is coming from.
This is the (slightly expanded) part where the leapfrog step is done. (trajectory.py::_theano_single_leapfrog)

    p_new = p + 0.5 * epsilon * q_grad  # half momentum update
    q_new = q + epsilon * H.pot.velocity(p_new)  # full position update
    q_new_grad = H.dlogp(q_new)
    p_new += 0.5 * epsilon * q_new_grad  # half momentum update
    # energy_new = energy(H, q_new, p_new)
    energy_new = H.pot.energy(q_new) - H.logp(q_new)
    v_new = H.pot.velocity(p_new)

In the third line we compute the gradient of the logp at q_new. In the 6th line we compute the logp at q_new. H.logp and H.dlogp are both instanced of theanof.CallableTensor. They use theano.clone to change the input. So we end up with two clones for most of our computations. Theano doesn't seem to merge them again.

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jun 25, 2017

I messed up the profiling a bit. I updated the comment. The times are now 20ms vs 14ms instead of 20ms vs 7ms.

@aseyboldt aseyboldt force-pushed the aseyboldt:no-theano-nuts branch 2 times, most recently from fe05701 to f07ea00 Jun 25, 2017

@@ -179,3 +182,80 @@ def test_empty_observed():
npt.assert_allclose(a.tag.test_value, np.zeros((2, 3)))
b = pm.Beta('b', alpha=1, beta=1, observed=data)
npt.assert_allclose(b.tag.test_value, np.ones((2, 3)) / 2)


class TestValueGradFunction(unittest.TestCase):

This comment has been minimized.

@ferrine

ferrine Jun 29, 2017

Member

I thought we moved to pytest

This comment has been minimized.

@aseyboldt

aseyboldt Jun 29, 2017

Author Member

Ah, I didn't really think about this but just used the unittest version. pytest supports this though. I guess the alternative would be to use a pytest fixture? So you'd use that?

This comment has been minimized.

@ferrine

ferrine Jun 29, 2017

Member

Yes, I tried pytest and fell in love with it

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.

@aseyboldt aseyboldt force-pushed the aseyboldt:no-theano-nuts branch 2 times, most recently from 28b954d to 56a26fa Jul 16, 2017

@ColCarroll

This comment has been minimized.

Copy link
Member

ColCarroll commented Jul 18, 2017

Happy to help/review this when needed/ready

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jul 18, 2017

@ColCarroll great. I'm still working on fixing HamiltonianMC again, but I'll let you know when it's done.

@aseyboldt aseyboldt force-pushed the aseyboldt:no-theano-nuts branch from 56a26fa to 444308b Jul 18, 2017

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jul 18, 2017

@ColCarroll I guess there will still be a test failure somewhere, but I think this is ready for review now.

@ColCarroll
Copy link
Member

ColCarroll left a comment

looks good to me -- wish we had the benchmark suite set up to check this out from a performance standpoint, but I think numpy is preferable for maintaining the library (and will give better error messages).

array = self._logp_dlogp_func.dict_to_array(point)

if self.generates_stats:
apoint, stats = self.astep(array)

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

just thinking for the future, this would be easier to reason about if self.astep always returned at least an empty dictionary for stats

This comment has been minimized.

@aseyboldt

aseyboldt Jul 18, 2017

Author Member

When I wrote this I though I'd rather not change the interface for step methods, so that we don't break custom step methods. Now I guess we should probably treat that as private part of the api, so yes, I agree.


def step(self, epsilon, state, out=None):
pot = self._potential
axpy = linalg.blas.get_blas_funcs('axpy', dtype=self._dtype)

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

can you add a comment or link to explain what axpy is/does?

This comment has been minimized.

@aseyboldt

aseyboldt Jul 18, 2017

Author Member

That's in the comments just above the places I use the function.

max_treedepth : int, default=10
The maximum tree depth. Trajectories are stoped when this
depth is reached.
early_max_treedepth : int, default=8
The maximum tree depth during tuning.
The maximum tree depth during the first 200 tuning samples.

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

technically, min(200, tune_steps)

This comment has been minimized.

@aseyboldt

aseyboldt Jul 18, 2017

Author Member

I really hope people don't use less than 200 tuning steps :-)
I think "during the first 200 tuning samples" would imply that, no?

@@ -73,15 +71,18 @@ def __str__(self):


class QuadPotential(object):
def velocity(self, x):
def velocity(self, x, out=None):

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

this is to use blas functions to change values in-place?

This comment has been minimized.

@aseyboldt

aseyboldt Jul 18, 2017

Author Member

yes. Since we are doing this in python now, avoiding allocations seems to matter.

@@ -179,3 +182,80 @@ def test_empty_observed():
npt.assert_allclose(a.tag.test_value, np.zeros((2, 3)))
b = pm.Beta('b', alpha=1, beta=1, observed=data)
npt.assert_allclose(b.tag.test_value, np.ones((2, 3)) / 2)


class TestValueGradFunction(unittest.TestCase):

This comment has been minimized.

@ColCarroll

ColCarroll Jul 18, 2017

Member

we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.

@kyleabeauchamp

This comment has been minimized.

Copy link
Contributor

kyleabeauchamp commented Jul 18, 2017

I have a dumb question: do we have a fully-GPU version of vanilla HMC? That could be easier to implement than NUTS, and because of architectural differences might end up being almost as good...

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jul 18, 2017

@kyleabeauchamp no, we don't have that yet. My plan going forward would be to use the transfer method in theano to allow us to return data on the gpu and avoid the copy, and then use pygpu for the couple of array ops we need to do in nuts/hmc.

@aseyboldt aseyboldt force-pushed the aseyboldt:no-theano-nuts branch 3 times, most recently from 511a6da to 6076d71 Jul 18, 2017

@aseyboldt aseyboldt force-pushed the aseyboldt:no-theano-nuts branch from 6076d71 to 56aba98 Jul 19, 2017

Binomial('outcome', n=1, p=prob)
with warnings.catch_warnings(record=True) as warns:
with pytest.warns(None) as warns:

This comment has been minimized.

@junpenglao

junpenglao Jul 19, 2017

Member

Is this finally going to solve the Travis weird fail? 💯

This comment has been minimized.

@aseyboldt

aseyboldt Jul 19, 2017

Author Member

No clue. But I thought it is worth a shot.

This comment has been minimized.

@aseyboldt

aseyboldt Jul 19, 2017

Author Member

ok, that wasn't it. This failure is a bit of a mystery to me, but it's not related to the PR anyway.

@aseyboldt

This comment has been minimized.

Copy link
Member Author

aseyboldt commented Jul 20, 2017

I think this is ready to be merged now.
I tested this against a couple of my models, and got a speedups between 20% to 40% for all of them. For very simple models (eg just a single normal variable) it is slower (~20%). I guess we could remedy that by working on the integration a bit more, maybe optionally use numba in there to avoid a couple of unnecessary passes over arrays.

@twiecki

This comment has been minimized.

Copy link
Member

twiecki commented Jul 20, 2017

@aseyboldt Very cool! Thanks, also to the reviewers.

@twiecki twiecki merged commit 61ba6ee into pymc-devs:master Jul 20, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.6%) to 87.38%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment