Compilation error for large number of categorical features #624

wcbeard · 2014-10-16T16:24:46Z

I'm not sure if using a large number of categorical variables is an abuse of pymc, or if I'm just doing it wrong. I've reproduced the error with synthetic data with 500 possible string values for feature X (though the error appears with fewer values, like 250). I'm using the glm module with the following model, which I can get working in statsmodels: glm('Y ~ C(X)':

import itertools as it
import string

def f(st):
    return ord(st[0]) + ord(st[1]) + np.random.randn()

wds = map(''.join, it.islice(it.permutations(string.ascii_uppercase, 2), 500))
wd_dat = np.random.choice(wds, 5000)
y = map(f, wd_dat)

data = pd.DataFrame(dict(Y=y, X=wd_dat))
data[:4]
Out[49]:
    X           Y
0  JA  139.636050
1  GU  156.806869
2  FZ  161.310029
3  HU  157.979341

When I try to run the following model

with mc.Model() as model:
    mc.glm.glm('Y ~ C(X)', data)
    trace = mc.sample(2000, mc.NUTS(), progressbar=True)

I get Exception: ('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-13.3.0-x86_64-i386-64bit-i386-2.7.8-64/tmp_cmBr5/mod.cpp:28159:32: fatal error: bracket nesting level exceeded maximum of 256.
full trace here.

Is this expected?

The text was updated successfully, but these errors were encountered:

stefan-pdx · 2014-12-17T16:57:57Z

I came across the same error as well. Were you able to come up with a workaround?

wcbeard · 2014-12-17T17:04:02Z

@slnovak Not yet, unfortunately.

twiecki · 2014-12-17T17:10:01Z

It's kinda odd to get the nesting level error. At the core, glm should just create a very large design matrix that then gets matrix-multiplied the coefficients vector (https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L94). Perhaps the issue is not the design matrix but rather the coefficients which I think are all individual RVs but maybe should be just a single vector (https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L92).

twiecki · 2014-12-17T17:17:46Z

So I guess you could try to replace this loop https://github.com/pymc-devs/pymc/blob/master/pymc/glm/glm.py#L88 with the creation of a random vector (e.g. coeffs = pm.Normal('coeffs', mu=0, sd=1, shape=len(reg_names)).

stefan-pdx · 2014-12-17T17:54:39Z

Well, for me, I was using a Dirichlet distribution with a shape of ~800. I was able to get the resulting Theano code to compile with include the following in ~/.theanorc:

[gcc]
cxxflags = -fbracket-depth=1024

However, I gave up as it was taking 20+ min for Theano to compile the model. I'm trying PyMC 2.3.4 to see if the model will run.

wcbeard · 2014-12-17T19:17:36Z

@twiecki When I try your suggestion (coeffs = Normal('coeffs', mu=0, sd=1, shape=len(reg_names) + 1) for the intercept) and y_est = theano.dot(np.asarray(dmatrix), coeffs) in the following line I get an error further on down when it tries converting coeffs to a set. (set(coeffs) => ValueError: length not known). I'm not familiar with the distribution types, so can't tell how to deal with iteration/data structure conversion.

twiecki · 2014-12-17T19:23:31Z

@d10genes Hm, yeah there are few changes more I'm afraid.
y_est = theano.dot(np.asarray(dmatrix), theano.tensor.stack(*coeffs)).reshape((1, -1)) needs to change to:
y_est = theano.dot(np.asarray(dmatrix), coeffs).reshape((1, -1))

and
return y_est, coeffs
to:
return y_est, [coeffs]

twiecki · 2014-12-17T19:24:35Z

@slnovak Note that there is no glm module in pymc 2. You can also try to do the regression manually in pymc3 which has syntax bit nicer for matrices.

wcbeard · 2014-12-17T20:38:05Z

@twiecki thanks! The second suggestion seemed to do it. At least, it ran without errors.

But either I'm not reading the trace plots right, or it's having trouble converging to the right solution. I can't tell if that's due to this change, or if my artificial data set and model are just too ill-defined.

If the code seems right to you, should I send a PR, or do we need something more robust (not sure what kind of additional tests this would call for)?

hgbrian · 2014-12-18T01:40:21Z

I had a similar problem when I was adding two Multinomials to the model per row of data. I could not include more than 20 datapoints in the model (out of one million!) I preferred the Python-based solution:
theano.config.gcc.cxxflags = "-fbracket-depth=16000" # default is 256
However, after making this change, I also found the theano compilation was too slow to be practical. It will be interesting to see if I can adapt the advice here to my problem. Thanks!

twiecki · 2014-12-18T07:27:55Z

Just setting the bracket depth is not the right solution. The model being constructed is just more complex than it needs to be.

@d10genes Cool that it's at least compling now! The convergence looks pretty odd indeed. Seems like one of the coefficients is being drawn to a bad region and that this interacts with the intercept, so this suggest a colinearity. You could try and scatter plot the intercept vs the offending coefficient. Above you said that you did len(coeffs) + 1 to include the intercept into the random vector. But then why is there an intercept still in the graph? That might be what's causing the problem.

Regarding a PR, it's a bit more tricky as I think both are valid use cases (individual priors for each regressor, and a random vector for all of them). Actually, maybe just an additional kwarg that causes creation of a random vector instead (reg_prior_as_vector or something like that). Thoughts?

eigenfoo · 2019-12-05T06:01:09Z

Assuming that this issue is stale. Closing.

twiecki added the bug label Dec 17, 2014

fonnesbeck mentioned this issue Feb 2, 2016

Inscrutable Theano code dump using NUTS #955

Closed

junpenglao mentioned this issue Mar 13, 2016

Inefficient sampling of large categorical model #1018

Closed

fonnesbeck mentioned this issue Oct 24, 2016

PyMC3 Modeling Show-stoppers #840

Closed

unrealwill mentioned this issue Feb 22, 2017

Problems with LocallyConnected1D Layer and Theano keras-team/keras#5479

Closed

eigenfoo closed this as completed Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation error for large number of categorical features #624

Compilation error for large number of categorical features #624

wcbeard commented Oct 16, 2014

stefan-pdx commented Dec 17, 2014

wcbeard commented Dec 17, 2014

twiecki commented Dec 17, 2014

twiecki commented Dec 17, 2014

stefan-pdx commented Dec 17, 2014

wcbeard commented Dec 17, 2014

twiecki commented Dec 17, 2014

twiecki commented Dec 17, 2014

wcbeard commented Dec 17, 2014

hgbrian commented Dec 18, 2014

twiecki commented Dec 18, 2014

eigenfoo commented Dec 5, 2019

Compilation error for large number of categorical features #624

Compilation error for large number of categorical features #624

Comments

wcbeard commented Oct 16, 2014

stefan-pdx commented Dec 17, 2014

wcbeard commented Dec 17, 2014

twiecki commented Dec 17, 2014

twiecki commented Dec 17, 2014

stefan-pdx commented Dec 17, 2014

wcbeard commented Dec 17, 2014

twiecki commented Dec 17, 2014

twiecki commented Dec 17, 2014

wcbeard commented Dec 17, 2014

hgbrian commented Dec 18, 2014

twiecki commented Dec 18, 2014

eigenfoo commented Dec 5, 2019