-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue using GPU #2448
Comments
Some more notes and updates. First off, what my
Nothing fancy here, I think. When using PyMC3 v3.1 from conda-forge, the kernel dies spontaneously whenever I'm trying to sample traces from the posterior distribution. This is consistent behaviour across all of the notebooks that I have in my Bayesian analysis recipes repository. Switching to master branch of PyMC3 doesn't change things. From prior experience, I suspected pygpu might be the culprit. I am currently using 0.6.8, but the latest version is 0.6.9. I updated pygpu to 0.6.9. Testing using the notebook I then switched to a different notebook to see if things were working okay - I switched then to my With this, I went back to the Confident that my "baseline" PyMC3 work could be done, I proceeded to try out the neural network notebook, Side note on multinomial:
|
To eliminate a moving part, how do things go if you run this as a script, instead of in the notebook? |
Hmmm, let me give that a shot. |
Same issue shows up - |
Man, I'm feeling really defeated right now... Working with the GPU is really difficult... FWIW, I am cross-linking this issue with the same post on Theano. |
GPU computing is tricky. I take it the model runs fine on CPUs? |
Smaller models are working fine. With the Bayesian NN model, when forcing usage of the CPU, the kernel dies spontaneously. I think it's a memory-related issue - Theano devs have responded. |
I ran the MNIST Conv Net on a very large instance (128GB I think) as it does require a lot of RAM. @ericmjl have you tried the same model without pymc3? |
@twiecki I haven't - I did forget to mention earlier on that I have successfully run the model within PyMC3 on my machine. I did something (cannot remember what) to my environment, and suddenly these memory issues cropped up. Rebuilding the environment from scratch gave me this memory issue. I'll give your suggestion a shot. |
Update my notes here for the benefit of everybody, in case someone else comes across the same issue. cc those who have partaken in the discussion: @twiecki, @fonnesbeck, @ColCarroll I began manually profiling my Jupyter notebook code, chunk-by-chunk. It looks like my memory woes are being caused by the following line: with model:
trace = approx.sample(5000) Using that, all 16GB of my CPU RAM + some swap space was being consumed. I saw this by using the Seeing as this was the issue, I hypothesized that the number of samples being drawn was the issue. Thus, I dropped the number of samples from 5K to 1K: with model:
trace = approx.sample(1000) And now my memory woes are gone, with only 6GB of RAM being consumed. Now I'm a happy Bayesian 😄, and might want to talk about the basics of Bayesian NNs next year at PyCon! 😃 😃 That said, it got me thinking - ~6GB of RAM consumption? Is this normal? To @twiecki: do you happen to know if there's a rule-of-thumb quick way of calculating memory consumption for NNs? |
So, you have a large model, resulting in a big preallocated trace? I guess this speaks to us still needing a non-RAM backend. Can you try seeing if 5k samples works with either a text or hdf5 trace? Would be nice to use dask for this. |
Yep, I have encountered similar problems training NNs in pymc3. One thing I have done in the past is separate my runs into multiple traces. I don't want to use the words "burn-in" and "thinning" for fear of inciting a flame-war, but those two knobs do allow micro-management of RAM usage.
|
Thanks for all the detail @ericmjl -- I have nothing helpful to add, but quick napkin math from looking at the notebook you posted: the model specification has six objects, whose sizes appear to be:
which means there are ~1 million |
@fonnesbeck: Thanks for the tip! I'm groping around in the dark w.r.t. how to pass in ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-265b5ff3aeb0> in <module>()
1 with model:
----> 2 trace = approx.sample(1000, trace='text')
TypeError: sample() got an unexpected keyword argument 'trace' I dug into the PyMC3 codebase - it looks like Doing some manual tracing (no pun intended) through the PyMC3 codebase, I'm starting to feel genuinely lost here on how I could store the approx. traces to, say, and HDF5 file. Would you be kind enough to point me in the right direction? @ColCarroll: Thanks too for helping me think through this! Yes, this now makes a lot of sense - each trace has to be one set of sampled parameter values. @kyleabeauchamp: Thanks too for chiming in! I'll keep that tip in mind. |
You’d want to import the
|
@fonnesbeck: hmm, I've just tried that, but I get an error. ---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-26-4c38acdde054> in <module>()
1 with model:
2 # trace = approx.sample(1000)
----> 3 trace = pm.sample(trace=HDF5('trace.h5'))
~/github/software/pymc3/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, **kwargs)
278 discard = tune if discard_tuned_samples else 0
279
--> 280 return sample_func(**sample_args)[discard:]
281
282
~/github/software/pymc3/pymc3/sampling.py in _sample(draws, step, start, trace, chain, tune, progressbar, model, random_seed, live_plot, live_plot_kwargs, **kwargs)
293 try:
294 strace = None
--> 295 for it, strace in enumerate(sampling):
296 if live_plot:
297 if live_plot_kwargs is None:
~/anaconda/envs/bayesian/lib/python3.6/site-packages/tqdm-4.15.0-py3.6.egg/tqdm/_tqdm.py in __iter__(self)
870 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
871
--> 872 for obj in iterable:
873 yield obj
874 # Update and print the progressbar.
~/github/software/pymc3/pymc3/sampling.py in _iter_sample(draws, step, start, trace, chain, tune, model, random_seed)
393 point, states = step.step(point)
394 if strace.supports_sampler_stats:
--> 395 strace.record(point, states)
396 else:
397 strace.record(point)
~/github/software/pymc3/pymc3/backends/hdf5.py in record(self, point, sampler_stats)
176 data = self.stats[str(i)]
177 for key, val in vars.items():
--> 178 data[key][self.draw_idx] = val
179
180 self.draw_idx += 1
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_objects.c:2846)()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_objects.c:2804)()
~/anaconda/envs/bayesian/lib/python3.6/site-packages/h5py/_hl/dataset.py in __setitem__(self, args, val)
628 mspace = h5s.create_simple(mshape_pad, (h5s.UNLIMITED,)*len(mshape_pad))
629 for fspace in selection.broadcast(mshape):
--> 630 self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
631
632 def read_direct(self, dest, source_sel=None, dest_sel=None):
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_objects.c:2846)()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_objects.c:2804)()
h5py/h5d.pyx in h5py.h5d.DatasetID.write (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/h5d.c:3700)()
h5py/_proxy.pyx in h5py._proxy.dset_rw (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_proxy.c:2028)()
h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dwrite (/home/ilan/minonda/conda-bld/h5py_1496889914775/work/h5py/_proxy.c:1738)()
OSError: Can't prepare for writing data (No appropriate function for conversion path) When switching to a |
Maybe this just causes an IO bottleneck. I wonder if dask would be faster. Haven't seen the HDF5 error before. Perhaps others have. |
The error in when using hdf5 is due to a bug in the stats for NUTS. Should be fixed by #2467. |
Should this issue and associated Theano issue be closed? Theano/Theano#6206 |
Will wait for @ericmjl to confirm, but yes, I think this is more of a PyMC3 feature request than a bug. |
@ColCarroll @nouiz yes, please go ahead and close the Theano issue. I will close this one here. |
Hey team,
I've been trying to debug this memory allocation issue for a while now, but can't seem to find a way out.
Firstly, on what I'm trying to do - I'm trying to implement a Bayesian NN using PyMC3. An example notebook is here.
Next up, my environment:
pygpu
0.6.8libgpuarray
0.6.8The error I keep getting when I run the aforementioned notebook keeps cropping up at the stage when sampling from the PPC:
Error message is:
Stack trace looks like this:
I initially suspected that this was a Theano problem, so I've cross-posted there already (will update with more detail), but could there be something that I'm doing wrong here?
Any help's appreciated, but if you guys are too busy, no rush here!
The text was updated successfully, but these errors were encountered: