Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation error for large number of categorical features #624

Closed
wcbeard opened this issue Oct 16, 2014 · 12 comments
Closed

Compilation error for large number of categorical features #624

wcbeard opened this issue Oct 16, 2014 · 12 comments
Labels

Comments

@wcbeard
Copy link

wcbeard commented Oct 16, 2014

I'm not sure if using a large number of categorical variables is an abuse of pymc, or if I'm just doing it wrong. I've reproduced the error with synthetic data with 500 possible string values for feature X (though the error appears with fewer values, like 250). I'm using the glm module with the following model, which I can get working in statsmodels: glm('Y ~ C(X)':

import itertools as it
import string

def f(st):
    return ord(st[0]) + ord(st[1]) + np.random.randn()

wds = map(''.join, it.islice(it.permutations(string.ascii_uppercase, 2), 500))
wd_dat = np.random.choice(wds, 5000)
y = map(f, wd_dat)

data = pd.DataFrame(dict(Y=y, X=wd_dat))
data[:4]
Out[49]:
    X           Y
0  JA  139.636050
1  GU  156.806869
2  FZ  161.310029
3  HU  157.979341

When I try to run the following model

with mc.Model() as model:
    mc.glm.glm('Y ~ C(X)', data)
    trace = mc.sample(2000, mc.NUTS(), progressbar=True)

I get Exception: ('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-13.3.0-x86_64-i386-64bit-i386-2.7.8-64/tmp_cmBr5/mod.cpp:28159:32: fatal error: bracket nesting level exceeded maximum of 256.
full trace here.

Is this expected?

@stefan-pdx
Copy link

I came across the same error as well. Were you able to come up with a workaround?

@twiecki twiecki added the bug label Dec 17, 2014
@wcbeard
Copy link
Author

wcbeard commented Dec 17, 2014

@slnovak Not yet, unfortunately.

@twiecki
Copy link
Member

twiecki commented Dec 17, 2014

It's kinda odd to get the nesting level error. At the core, glm should just create a very large design matrix that then gets matrix-multiplied the coefficients vector (https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L94). Perhaps the issue is not the design matrix but rather the coefficients which I think are all individual RVs but maybe should be just a single vector (https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L92).

@twiecki
Copy link
Member

twiecki commented Dec 17, 2014

So I guess you could try to replace this loop https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L88 with the creation of a random vector (e.g. coeffs = pm.Normal('coeffs', mu=0, sd=1, shape=len(reg_names)).

@stefan-pdx
Copy link

Well, for me, I was using a Dirichlet distribution with a shape of ~800. I was able to get the resulting Theano code to compile with include the following in ~/.theanorc:

[gcc]
cxxflags = -fbracket-depth=1024

However, I gave up as it was taking 20+ min for Theano to compile the model. I'm trying PyMC 2.3.4 to see if the model will run.

@wcbeard
Copy link
Author

wcbeard commented Dec 17, 2014

@twiecki When I try your suggestion (coeffs = Normal('coeffs', mu=0, sd=1, shape=len(reg_names) + 1) for the intercept) and y_est = theano.dot(np.asarray(dmatrix), coeffs) in the following line I get an error further on down when it tries converting coeffs to a set. (set(coeffs) => ValueError: length not known). I'm not familiar with the distribution types, so can't tell how to deal with iteration/data structure conversion.

@twiecki
Copy link
Member

twiecki commented Dec 17, 2014

@d10genes Hm, yeah there are few changes more I'm afraid.
y_est = theano.dot(np.asarray(dmatrix), theano.tensor.stack(*coeffs)).reshape((1, -1)) needs to change to:
y_est = theano.dot(np.asarray(dmatrix), coeffs).reshape((1, -1))

and
return y_est, coeffs
to:
return y_est, [coeffs]

@twiecki
Copy link
Member

twiecki commented Dec 17, 2014

@slnovak Note that there is no glm module in pymc 2. You can also try to do the regression manually in pymc3 which has syntax bit nicer for matrices.

@wcbeard
Copy link
Author

wcbeard commented Dec 17, 2014

@twiecki thanks! The second suggestion seemed to do it. At least, it ran without errors.

But either I'm not reading the trace plots right, or it's having trouble converging to the right solution. I can't tell if that's due to this change, or if my artificial data set and model are just too ill-defined.

If the code seems right to you, should I send a PR, or do we need something more robust (not sure what kind of additional tests this would call for)?

@hgbrian
Copy link
Contributor

hgbrian commented Dec 18, 2014

I had a similar problem when I was adding two Multinomials to the model per row of data. I could not include more than 20 datapoints in the model (out of one million!) I preferred the Python-based solution:
theano.config.gcc.cxxflags = "-fbracket-depth=16000" # default is 256
However, after making this change, I also found the theano compilation was too slow to be practical. It will be interesting to see if I can adapt the advice here to my problem. Thanks!

@twiecki
Copy link
Member

twiecki commented Dec 18, 2014

Just setting the bracket depth is not the right solution. The model being constructed is just more complex than it needs to be.

@d10genes Cool that it's at least compling now! The convergence looks pretty odd indeed. Seems like one of the coefficients is being drawn to a bad region and that this interacts with the intercept, so this suggest a colinearity. You could try and scatter plot the intercept vs the offending coefficient. Above you said that you did len(coeffs) + 1 to include the intercept into the random vector. But then why is there an intercept still in the graph? That might be what's causing the problem.

Regarding a PR, it's a bit more tricky as I think both are valid use cases (individual priors for each regressor, and a random vector for all of them). Actually, maybe just an additional kwarg that causes creation of a random vector instead (reg_prior_as_vector or something like that). Thoughts?

@eigenfoo
Copy link
Member

eigenfoo commented Dec 5, 2019

Assuming that this issue is stale. Closing.

@eigenfoo eigenfoo closed this as completed Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants