Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Error: OrderedLogistic #3535

Closed
aloctavodia opened this issue Jun 29, 2019 · 19 comments · Fixed by #3563 or #3572
Closed

Memory Error: OrderedLogistic #3535

aloctavodia opened this issue Jun 29, 2019 · 19 comments · Fixed by #3563 or #3572

Comments

@aloctavodia
Copy link
Member

I am updating and re-running the statistical rethinking notebooks. I get a memory allocation error with model m_11 (code block 11.5). The problem seems to be related with #3383, reverting changes in categorical distribution previous to that PR, fix the issue. @lucianopaz probably you have a better idea of what is going on here.

@tohtsky
Copy link
Contributor

tohtsky commented Jul 3, 2019

I think the basic example provided in OrderedLogistic (slightly modified to use ADVI) behaves differently in the recent commits.
I've run the following code in the current head and old commits like 9ef2947.

n1_c = 300; n2_c = 300; n3_c = 300
cluster1 = np.random.randn(n1_c) + -1
cluster2 = np.random.randn(n2_c) + 0
cluster3 = np.random.randn(n3_c) + 2

x = np.concatenate((cluster1, cluster2, cluster3))
y = np.concatenate((1*np.ones(n1_c),
                    2*np.ones(n2_c),
                    3*np.ones(n3_c))) - 1

# Ordered logistic regression
with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-1,1], sigma=10, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    inference = pm.fit(30000, method='advi')
    tr = inference.sample(1000)

In the older versions, the value of ELBO seems to converge to a reasonable value ~ 10^2, but in the recent commits, the progress bar shows large value ~ 10 ^5.

I suspect this is caused by logp method in Categorical.
In the older versions, when p.ndim == 2 (which is the case in the above example), the log likelihood

a = tt.log(p[tt.arange(p.shape[0]), value_clip])

will be a length p.shape[0] vector.
In recent versions, however, a defined as

pattern = (p.ndim - 1,) + tuple(range(p.ndim - 1))
a = tt.log(p.dimshuffle(pattern)[value_clip])

will be a p.shape[0] x p.shape[0] matrix (which has 900 **2 entries in the above example, roughly matching with the scale of ELBO).

@junpenglao
Copy link
Member

Thanks @tohtsky, I think your diagnosis is right.

@lucianopaz
Copy link
Contributor

@aloctavodia, sorry for the delay in taking a look at this. I'm still mostly offline until next week.

I looked a bit more into the commit you referenced and I think that there may be a bug in Categorical.logp, specifically here. If p is multidimensional, the dimshuffle moves the last axis around to simplify advanced indexing into it with the supplied values. The resulting a will have a weird shape like (value.shape, p.shape[:-1]), instead of having value.shape at the end. This could lead to strange broadcasting with observeds and produce an intermediate array of shape (value.shape, value.shape). So this could cause the memory error you encountered. Maybe that line could be replaced with this:

a = tt.log(p[..., value_clip])

I'll be able to test this in a bit less than two weeks, but if you want to try it out first, you're welcome to do so.

@lucianopaz
Copy link
Contributor

Oops, I skipped over @tohtsky's answer that points to exactly the same line of code that I thought was causing the bug in my previous answer. Maybe we could try with the ellipses in the indexing instead of doing a dimshuffle.

@aloctavodia
Copy link
Member Author

I tried replacing the line you suggested and I still see the same problem :-(

@tohtsky
Copy link
Contributor

tohtsky commented Jul 16, 2019

Simply flattening multi-dimensional tensor into a 2d array and then reshaping the resulting logp vector back into the original tensor shape should work?

        if p.ndim > 1:
            original_shape = p.shape[:-1]
            p_flatten = p.reshape((-1, p.shape[-1]))
            a = tt.log(
                p_flatten[
                    (tt.arange(p_flatten.shape[0]), value_clip.ravel())
                ]).reshape(
                    original_shape
            )
        else:
            a = tt.log(p[value_clip])

@lucianopaz
Copy link
Contributor

I'll have time to look into this issue this Friday.

@bdyetton
Copy link
Contributor

Hi,
Any progress on this issue?

@lucianopaz
Copy link
Contributor

lucianopaz commented Jul 22, 2019

I managed to write a small test involving logp that highlights the error:

import numpy as np
from scipy.special import logit
import pymc3 as pm


loge = np.log10(np.exp(1))
size = 100
p = np.ones(10) / 10
cutpoints = logit(np.linspace(0, 1, 11)[1:-1])
obs = np.random.randint(0, 1, size=size)
with pm.Model():
    ol = pm.OrderedLogistic("ol", eta=0, cutpoints=cutpoints, observed=obs)
    c = pm.Categorical("c", p=p, observed=obs)

print(ol.logp({"ol": 1}) * loge)
print(c.logp({"c": 1}) * loge)

The OrderedLogistic variable ol, given the provided cutpoints, should be equivalent to the categorical RV c. When we look at the returned logp for each RV on a given class, as all are equally likely, the log10 should be approximately equal to -size == -100. This is the case for the Categorical but the OrderedLogistic returns -size**2 == -10000. The problem seems to be in an unnecessary shape padding done in the OrderedLogistic.__init__.

Furthermore,

>>> ol.distribution.p.ndim
2
>>> c.distribution.p.ndim
1

However, there is also an additional shape issue when we give an array of cutpoints, and how this broadcasts with the passed observed. I'll try to finish a fix tomorrow.

lucianopaz added a commit to lucianopaz/pymc that referenced this issue Jul 23, 2019
@lucianopaz
Copy link
Contributor

In the end, it wasn't a broadcasting problem. It was an indexing problem when p was multidimensional.

twiecki pushed a commit that referenced this issue Jul 23, 2019
* Added tests for issue

* Fix for #3535

* Added release notes
@lucianopaz lucianopaz reopened this Jul 25, 2019
@lucianopaz
Copy link
Contributor

tt.choose ended up making more of a mess so we'll have to come up with a different fix for this problem.

@lucianopaz
Copy link
Contributor

@seberg was kind enough to point me to take_along_axis which is exactly what we need to replace choose with to fix this issue and not fall into the problems raised in #3564 and #3567. Luckily, take_along_axis is implemented using advanced indexing and we should be able to copy the relevant part to get a theano based implementation to use in Categorical.logp.

@bdyetton
Copy link
Contributor

bdyetton commented Jul 30, 2019

Hi @lucianopaz,
Unfortunately, that PR is still very slow for a multi-dim p

       data = np.random.randint(0, 3, size=(1000, 1))

    with pm.Model() as model:
        tp1 = pm.Dirichlet('tp1', a=np.array([0.25]*4), shape=(4,)) #4 Free RV
        obs = pm.Categorical('obs', p=tp1, observed=data)
        trace = pm.sample() #super fast!

    data_indexer = np.random.randint(0,2,size=(1000,))

    with pm.Model() as model:
        tp1 = pm.Dirichlet('tp1', a=np.array([0.25]*4), shape=(2,4)) #8 Free RV
        obs = pm.Categorical('obs', p=tp1[data_indexer, :], observed=data)
        trace = pm.sample() #takes ages!

Does the second model sample ok for you (just incase i've done something silly with my install)?

@lucianopaz
Copy link
Contributor

@bdyetton, it samples super slow for me too. I'll try to find out why this is happening.

@bdyetton
Copy link
Contributor

bdyetton commented Jul 31, 2019 via email

@bdyetton
Copy link
Contributor

bdyetton commented Jul 31, 2019

@lucianopaz I'm not sure if this is helpful at all, but the second model above will not begin sampling with the slice step method, so this is not an issue affecting NUTS only.

@bdyetton
Copy link
Contributor

bdyetton commented Aug 5, 2019

Hi @lucianopaz, any progress? Anything I can do to help?

@lucianopaz
Copy link
Contributor

@bdyetton, sorry, I have deal with other stuff from work first. Once I finish, I'll be able to look into this more deeply.

@bdyetton
Copy link
Contributor

bdyetton commented Aug 6, 2019

@lucianopaz, Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants