Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple chains causes RecursionError #879

Closed
fonnesbeck opened this issue Nov 25, 2015 · 78 comments
Closed

Running multiple chains causes RecursionError #879

fonnesbeck opened this issue Nov 25, 2015 · 78 comments

Comments

@fonnesbeck
Copy link
Member

Setting the njobs parameter to run multiple chains results in an error:

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-59-548e16bedce3> in <module>()
      6 
      7 
----> 8     trace = sample(5000, njobs=2)

/Users/fonnescj/Github/pymc3/pymc3/sampling.py in sample(draws, step, start, trace, chain, njobs, tune, progressbar, model, random_seed)
    153         sample_args = [draws, step, start, trace, chain,
    154                        tune, progressbar, model, random_seed]
--> 155     return sample_func(*sample_args)
    156 
    157 

/Users/fonnescj/Github/pymc3/pymc3/sampling.py in _mp_sample(njobs, args)
    274 def _mp_sample(njobs, args):
    275     p = mp.Pool(njobs)
--> 276     traces = p.map(argsample, args)
    277     p.close()
    278     return merge_traces(traces)

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/reduction.py in dumps(cls, obj, protocol)
     48     def dumps(cls, obj, protocol=None):
     49         buf = io.BytesIO()
---> 50         cls(buf, protocol).dump(obj)
     51         return buf.getbuffer()
     52 

RecursionError: maximum recursion depth exceeded
@twiecki
Copy link
Member

twiecki commented Nov 27, 2015

Ran into the same issue.

@twiecki
Copy link
Member

twiecki commented Nov 27, 2015

It seems to work for simpler models, but the stochastic volatility model I can only run with njobs=2, but it breaks with njobs=4. So odd.

@twiecki
Copy link
Member

twiecki commented Dec 17, 2015

Can you try if e873d6d fixes it?

@fonnesbeck
Copy link
Member Author

Well, I get a different error, so that's progress.

MaybeEncodingError: Error sending result: '[<MultiTrace: 1 chains, 40000 iterations, 9 variables>]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'

@twiecki
Copy link
Member

twiecki commented Dec 17, 2015

And what a specific error it is. MaybeEncodingErrorMaybeNot

@fonnesbeck
Copy link
Member Author

Yeah, that seemed odd -- creating an Exception subclass for an error that you're not totally sure about.

@twiecki
Copy link
Member

twiecki commented Dec 17, 2015

Anyway, it looks like we're passing maybe an object where an int is expected?

@hvasbath
Copy link
Contributor

hvasbath commented Feb 8, 2016

You can somewhat hack this with sys.setrecursionlimit(2000), but this also works up to a certain amount of parameters. With my latest model around 450 parameters it doesnt help. As I really need the parallel implementation to work otherwise my model has to run for monthes, I would want to look into this. Can you point me to some code lines where I could start looking- as I am not so familiar yet with the code base. Thank you!

@hvasbath
Copy link
Contributor

hvasbath commented Feb 8, 2016

With increasing the recursion limit and the latest commit from twiecki above( e873d6d ) I get this error. It keeps running doing nothing. Does anybody have any advice where I could start investigating?

Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call

@fonnesbeck
Copy link
Member Author

All of the multiprocessing business for PyMC3 is in the sampling module. Its pretty basic mapping of processes to the elements of a multiprocessing Pool. We might want to explore using ipyparallel for parallel processing.

@twiecki
Copy link
Member

twiecki commented Feb 8, 2016

I have also considered switching. The issue is that currently you can't launch processes internally (see ipython/ipyparallel#22 for a plan to change that).

@fonnesbeck
Copy link
Member Author

That should not be a deal-breaker. Forcing the user to spin up ipcluster is not particularly onerous, particularly if you are working in Jupyter, where it is just a tab in the interface. I think its a small price to pay for more robust parallelism, and if it gets automated in the future, all the better.

@datnamer
Copy link

datnamer commented Feb 8, 2016

What about Dask?

@fonnesbeck
Copy link
Member Author

Would Dask be effective here? I could see if we were applying the same algorithm to subsets of a dataset, but a set of parallel chains executes over the entire dataset for each chain. So, its not clear how Dask's collections would be beneficial. That said, it may be useful if we ever implement expectation propagation, which does subdivide the data.

@datnamer
Copy link

datnamer commented Feb 8, 2016

Dask imperative + multiprocessing scheduler can schedule the chains without needing to use a specific collection to chunk.

But this is out of my depth.

Maybe @mrocklin can chime in.

@twiecki
Copy link
Member

twiecki commented Feb 8, 2016

I don't think Dask, although awesome, can be leveraged here.

@mrocklin
Copy link

mrocklin commented Feb 8, 2016

If someone can briefly describe the problem I'd be happy to chime in if there is potential overlap. The dask schedulers are useful well outside the common use case of big chunked arrays. If you're considering technologies like multiprocessing or ipyparallel it's possible that one of the dask schedulers could be relevant.

@fonnesbeck
Copy link
Member Author

@mrocklin Matt, this is Monte Carlo sampling for Bayesian statistical modeling. Its an embarrassingly parallel task that just simulates Markov chains using the same model on the same dataset, then uses the sampled chains (the output of the algorithm) for inference. We are currently using the multiprocessing module for this, but are contemplating a move to something more robust.

@mrocklin
Copy link

mrocklin commented Feb 8, 2016

Something non-trivial must be going on to cause multiprocessing to hang.

Looking at the traceback it seems like you might be trying to send something that pickle doesn't like? Historically I've gotten around this by pre-serializing everything with dill or cloudpickle before I hand things off to multiprocessing. This is what dask.multiprocessing.get does.

If this is what is going on then the pathos library would probably be a decent drop in replacement for you all. It's a multiprocessing clone that uses dill.

But really, I'm just guessing at the problem that you're trying to solve and so am probably out of my depth here. Happy to help if I can. Best of luck.

@fonnesbeck
Copy link
Member Author

Thanks, Matt. Unfortunately pathos appears not to support Python 3 yet, so I will look at explicitly passing everything to dill.

@mrocklin
Copy link

I write a function like the following:

def apply(serialized_func, serialized_args, serialized_kwargs):
    func = dill.loads(serialized_func)
    args = dill.loads(serialized_args)
    kwargs = dill.loads(serialized_kwargs)

And then I dumps my func, args, kwargs ahead of time and call them with the apply function remotely. Something like the following:

pool.map(apply, [dill.dumps(func) for i in range(len(sequence))], [dill.dumps(args) for args in sequence])

<self serving> Or, you can always just use dask.multiprocessing.get, where this work is already done. </self serving>

@fonnesbeck
Copy link
Member Author

I might have found a solution using Joblib, but will give this a shot if that doesn't work. Thanks again.

@mrocklin
Copy link

Oh great. That's much simpler.

@tyarkoni
Copy link
Contributor

I don't think this solves the problem, unfortunately... On the joblib branch, with njobs=4 and a pretty big model, I still get a max recursion exceeded exception (see below). On inspection, it looks like Joblib uses multiprocessing as its default backend, so I guess that makes sense. I tried switching to the threading backend, but that failed with a different set of errors.

Traceback (most recent call last):
  File "run_wm.py", line 53, in <module>
    run_model(40)
  File "run_wm.py", line 40, in run_model
    trace = model.run(samples=250, verbose=True, find_map=False, njobs=4)
  File "/Users/tal/Dropbox/Projects/RandomStimuli/code/pymcwrap/pymcwrap/model.py", line 383, in run
    samples, start=start, step=step, progressbar=verbose, njobs=njobs)
  File "/usr/local/lib/python3.5/site-packages/pymc3-3.0-py3.5.egg/pymc3/sampling.py", line 146, in sample
    return sample_func(**sample_args)
  File "/usr/local/lib/python3.5/site-packages/pymc3-3.0-py3.5.egg/pymc3/sampling.py", line 272, in _mp_sample
    **kwargs) for i in range(njobs))
  File "/usr/local/lib/python3.5/site-packages/joblib-0.9.4-py3.5.egg/joblib/parallel.py", line 810, in __call__
    self.retrieve()
  File "/usr/local/lib/python3.5/site-packages/joblib-0.9.4-py3.5.egg/joblib/parallel.py", line 727, in retrieve
    self._output.extend(job.get())
  File "/usr/local/Cellar/python3/3.5.0/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/local/Cellar/python3/3.5.0/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
    put(task)
  File "/usr/local/lib/python3.5/site-packages/joblib-0.9.4-py3.5.egg/joblib/pool.py", line 368, in send
    CustomizablePickler(buffer, self._reducers).dump(obj)
RecursionError: maximum recursion depth exceeded

@fonnesbeck
Copy link
Member Author

It was worth a shot. I will try flavoring it with a little dill.

@fonnesbeck
Copy link
Member Author

Actually, joblib serializes the arguments for us, so that's not the solution. Increasing the recursion limit helps (as @eigenblutwurst notes), which I have done inside sample. This problem may resurface with bigger models. Works well on the rugby analytics example (which I have modified) using 4 cores.

@twiecki
Copy link
Member

twiecki commented Feb 11, 2016

joblib/joblib#240

@grburgess
Copy link

I just updated to the latest Theano 0.8 and pymc3 and this problem has disappeared for me.
Strange thing though, while I build python manually with setup.py install, it still complained that it wanted Theano 0.7. The install seemed to go ok though.

@hvasbath
Copy link
Contributor

yes for me it also wants to install theano 0.7 although I have the dev version thats somehow anoying, I simply disabled it in the setup script, although there must be a nice way.

@twiecki
Copy link
Member

twiecki commented Mar 18, 2016

It's trying to pull 0.7 when you run pymc3's setup.py?

@hvasbath
Copy link
Contributor

Yes it does.

@grburgess
Copy link

Yes, it seemed to install fine and use Theano 0.8, but it was rather confusing.

@hvasbath
Copy link
Contributor

I have to abort it because when I let it install it, my import uses the 0.7 version instead of the dev version. They made sooooo many improvements in the current dev version so it is really significant to use the dev version.

@twiecki
Copy link
Member

twiecki commented Mar 18, 2016

f9de16e should fix that.

@hvasbath
Copy link
Contributor

Ah great thx!

@grburgess
Copy link

Fixed it. thanks!

@springcoil
Copy link
Contributor

Is it time to shut this?

@grburgess
Copy link

I haven't done extensive testing, but on some high dimensional problems that originally threw the recursion error, the problem has disappeared. So perhaps for now it is solved. :)

@twiecki
Copy link
Member

twiecki commented Mar 18, 2016

That sounds amazing. I'll close it but feel free to reopen if the problem persists with master pymc3 and theano.

@twiecki twiecki closed this as completed Mar 18, 2016
@jonsedar
Copy link
Contributor

Thanks for the recent bugfixes guys, also the updates to the build dependencies mean I'm now running theano: 0.8.0rc1 and either or both changes seem to have increased the theshold at which I was finding recursion errors.

EDIT: Okay, well - that does seem to have fixed it. I think I have a different bug though:

  1. for a sufficiently complex model, the first time I create and sample it using njobs > 1, the processes start (I'm viewing in htop) and then they die without throwing an error
  2. If I re-run the sampling then the processes seem to run fine.

I assume the difference in 2 is that the model is already cached. It's tricky to replicate though, a bit of a Heisenbug!

@hvasbath
Copy link
Contributor

I also still get my segmentation faults - also with creating all the Text backends in advance...

@vivek-hari
Copy link

vivek-hari commented Apr 19, 2016

Oh! Really. Even with the latest pymc3 version, I am getting the same error with njobs=2.

multiprocessing.pool.MaybeEncodingError: Error sending result: '[<MultiTrace: 1 chains, 10 iterations, 2106 variables>]'. Reason: 'RuntimeError('maximum recursion depth exceeded',)'

    trace = pm.sample(n_samples, step=step_func, start=start, njobs=n_chains, progressbar=False)
  File "/home/user/.local/lib/python2.7/site-packages/pymc3/sampling.py", line 150, in sample
    return sample_func(**sample_args)
  File "/home/user/.local/lib/python2.7/site-packages/pymc3/sampling.py", line 282, in _mp_sample
    **kwargs) for i in range(njobs))
  File "/home/user/.local/lib/python2.7/site-packages/joblib/parallel.py", line 810, in __call__
    self.retrieve()
  File "/home/user/.local/lib/python2.7/site-packages/joblib/parallel.py", line 727, in retrieve
    self._output.extend(job.get())
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<MultiTrace: 1 chains, 10 iterations, 2106 variables>]'. Reason: 'RuntimeError('maximum recursion depth exceeded',)'

I have pymc3-3.0, numpy-1.11.0, Theano-0.8.1, scipy-0.17.0 installed.
Anyone else facing the same issue in the latest version of pymc3?

@fonnesbeck
Copy link
Member Author

By "latest pymc3 version" do you mean that you installed it from GitHub master? That is,

pip install -U git+https://github.com/pymc-devs/pymc3.git

@vivek-hari
Copy link

I installed using

pip install --process-dependency-links git+https://github.com/pymc-devs/pymc3

@fonnesbeck
Copy link
Member Author

Make sure you use the -U flag or it may not update. I have not had this error since we closed this issue, so my first guess is that your update did not stick.

@vivek-hari
Copy link

Oh.. Thank you so much for your quick response. I'll update using -U flag and will get back. Thanks again!

@vivek-hari
Copy link

Sorry @fonnesbeck , installing pymc3 with -U also leads to same error.
Even I removed all the packages(pymc3, numpy, scipy, theano) from my machine and tried fresh installation of pymc3 using pip install -U git+https://github.com/pymc-devs/pymc3.git. It also ended up in RuntimeError('maximum recursion depth exceeded',).

I have,
Python 2.7.6,
pymc3.0,
matplotlib-1.5.1, joblib-0.9.4, numpy-1.11.0, pandas-0.18.0, patsy-0.4.1, pydot_ng-1.0.0, pyparsing-2.1.1,scipy-0.17.0,
Theano-0.8.1
installed in my machine.

nvidia-smi toolkit gives following details,
NVIDIA-SMI 346.96, Driver Version: 346.96, 4 GPU(0,1,2,3).

My .theanorc config is,

[global]
device = gpu
floatX = float32
assert_no_cpu_op = warn
[cuda]
root = /usr/local/cuda
[nvcc]
fastmath = True
[pycuda]
init = True

Is there anything else to be done?

@twiecki
Copy link
Member

twiecki commented Apr 21, 2016

Perhaps the GPU utilization is at fault? Have you tried with CPU?

On Thu, Apr 21, 2016 at 9:04 AM, Vivek Harikrishnan Ramalingam <
notifications@github.com> wrote:

Sorry @fonnesbeck https://github.com/fonnesbeck , installing pymc3 with
-U also leads to same error.
Even I removed all the packages(pymc3, numpy, scipy, theano) from my
machine and tried fresh installation of pymc3 using pip install -U git+
https://github.com/pymc-devs/pymc3.git. It also ended up in
RuntimeError('maximum recursion depth exceeded',).

I have,
Python 2.7.6,
pymc3.0,
matplotlib-1.5.1, joblib-0.9.4, numpy-1.11.0, pandas-0.18.0, patsy-0.4.1,
pydot_ng-1.0.0, pyparsing-2.1.1,scipy-0.17.0,
Theano-0.8.1
installed in my machine.

nvidia-smi toolkit gives following details,
NVIDIA-SMI 346.96, Driver Version: 346.96, 4 GPU(0,1,2,3).

My .theanorc config is,

[global]
device = gpu
floatX = float32
assert_no_cpu_op = warn
[cuda]
root = /usr/local/cuda
[nvcc]
fastmath = True
[pycuda]
init = True

Is there anything else to be done?


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#879 (comment)

@vivek-hari
Copy link

Thanks @twiecki I will try with CPU and post my updates.

@vivek-hari
Copy link

device=cpu in .theanorc also raises RuntimeError('maximum recursion depth exceeded',).

@vivek-hari
Copy link

Below snippet is I am trying to execute.

import pymc3 as pm
import theano.tensor as T
import pandas

def tinvlogit(x):
    return T.exp(x) / (1 + T.exp(x))

pandas_df = pandas.read_csv("data.csv")

x_col1 = pandas_df['col1']
x_col2 = pandas_df['col2']
x_col3 = pandas_df['col3']
n_col3 = len(pandas_df['col3'].unique())

with pm.Model() as model:
        b_0 = pm.Normal('b_0', mu=0, sd=100)
        b_col1 = pm.Normal('b_col1', mu=0, sd=100)
        b_col2 = pm.Normal('b_col2', mu=0, sd=100)
        sigma_col3 = pm.HalfNormal('sigma_col3', sd=100)
        b_col3 = pm.Normal('b_col3', mu=0, sd=sigma_col3, shape=n_col3)

        for i in range(0, len(pandas_df)):
            p = pm.Deterministic('p', T.maximum(0, T.minimum(1, tinvlogit(
                b_0 + b_col1 * x_col1.at[i] + b_col2 * x_col2.at[i] + b_col3[x_col3.at[i]))))

        y = pm.Bernoulli('y', p, observed=pandas_df.y)

        start = pm.find_MAP()

        step_func = pm.NUTS()

        trace = pm.sample(5000, step=step_func, start=start, njobs=2, progressbar=True)

pm.sample fails with RuntimeError('maximum recursion depth exceeded')

pandas_df is pandas dataframe with columns col1(decimal), col2(decimal), col3(integer between 1-10), y(0 or 1) and has 50000 rows.

@hvasbath
Copy link
Contributor

You get the recursion error because your graph will be very long as your loop will be running for 50k times, each time with all the nodes. Although I dont really get the purpose of your model I have the feeling you could vectorize it and get rid of the loop. The RVs have a shape parameter where you can simply create vectors of length of your data frame.
The way you do it now p will be always overwritten and only the last sample of your dataframe will go into the cost. Or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests