Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove default multiprocessing when on Windows? #3403

Closed
david-cortes opened this issue Mar 11, 2019 · 9 comments
Closed

Remove default multiprocessing when on Windows? #3403

david-cortes opened this issue Mar 11, 2019 · 9 comments

Comments

@david-cortes
Copy link

I’m not sure how to reproduce it, but oftentimes, depending on the type of model and random seed, pymc3’s NUTS sampler would fail and crash the ipython notebook along with it in Windows 10. This is the stack trace that I get from running it in a regular python process (not IPython notebook):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ipython\macroeconomic_forecast\crashing_script.py", line 76, in <module>
    trace = pm.sample(random_seed = 1)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\sampling.py", line 439, in sample
    trace = _mp_sample(**sample_args)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\sampling.py", line 986, in _mp_sample
    chain, progressbar)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 313, in __init__
    for chain, seed, start in zip(range(chains), seeds, start_points)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 313, in <listcomp>
    for chain, seed, start in zip(range(chains), seeds, start_points)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 204, in __init__
    self._process.start()
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "crashing_script.py", line 76, in <module>
    trace = pm.sample(random_seed = 1)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\sampling.py", line 439, in sample
    trace = _mp_sample(**sample_args)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\sampling.py", line 986, in _mp_sample
    chain, progressbar)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 313, in __init__
    for chain, seed, start in zip(range(chains), seeds, start_points)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 313, in <listcomp>
    for chain, seed, start in zip(range(chains), seeds, start_points)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 204, in __init__
    self._process.start()
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\david.cortes\AppData\Local\Continuum\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

Running it on jupyer, I would get the following message in the console:

forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFC2EA33B58  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFC886156FD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFC8BBA3DC4  Unknown               Unknown  Unknown
ntdll.dll          00007FFC8BCF3691  Unknown               Unknown  Unknown
forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFC2EA33B58  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFC886156FD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFC8BBA3DC4  Unknown               Unknown  Unknown
ntdll.dll          00007FFC8BCF3691  Unknown               Unknown  Unknown
forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFC2EA33B58  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFC886156FD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFC8BBA3DC4  Unknown               Unknown  Unknown
ntdll.dll          00007FFC8BCF3691  Unknown               Unknown  Unknown
forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFC2EA33B58  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFC886156FD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFC8BBA3DC4  Unknown               Unknown  Unknown
ntdll.dll          00007FFC8BCF3691  Unknown               Unknown  Unknown
forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFC2EA33B58  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFC886156FD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFC8BBA3DC4  Unknown               Unknown  Unknown
ntdll.dll          00007FFC8BCF3691  Unknown               Unknown  Unknown
[I 10:24:27.464 NotebookApp] Interrupted...
[I 10:24:27.474 NotebookApp] Shutting down 2 kernels

If I run the same script on Linux, I would instead get SamplingError: Bad initial energy, without crashing the notebook along.

Setup: Python 3.6.5, running on Windows 10, theano 1.0.3, pymc3 3.6.

Can attach a script and data to reproduce it, but it's not a minimalistic example as I haven't been able to create that.

@twiecki
Copy link
Member

twiecki commented Mar 11, 2019

There is a known problem with multiprocessing on windows that causes an exception to fail in this spectacular fashion.

@twiecki twiecki changed the title NUTS crashes IPython kernel with some models in Windows Remove default multiprocessing when on Windows? Mar 11, 2019
@david-cortes
Copy link
Author

There is a known problem with multiprocessing on windows that causes an exception to fail in this spectacular fashion.

I guess that's one way of solving it, but if this only happens when exceptions are thrown, wouldn't it be simpler and better for final users to instead throw the exception after all processes have finished and their output collected?

@twiecki
Copy link
Member

twiecki commented Mar 11, 2019

@david-cortes Unfortunately we never see the output in that case.

@lucianopaz
Copy link
Contributor

@david-cortes, to address your particular problem, could you enclose the sample call in the following if statement?

if __name__ == "__main__":
    with model:
        pm.sample()

This could help you run the notebook on windows (but you would still reach the bad initial energy you saw on linux).

The recurrent issue was discussed in many places (1, 2, 3, 4, and I imagine that other threads too). The problem is caused because pm.sample tries to sample multiple chains in parallel using multiprocessing processes. On linux, process are forked from the main process, and created child processes get a copy of the state of their parent's memory. Windows does not have process forking, so it defaults to spawning new processes. These are completely new and independent process from their parent, and try to run the main module (on jupyter, that would mean that they try to run the same cell of the parent, or maybe up to the same cell, I'm not totally sure). This finds the sample call, which also tries to start a process pool, which would lead to an infinite creation of processes. That is why windows requires the if __name__ == "__main__" statement to ensure that only original process creates offspring. There are also some other errors involving pickling which don't show up on linux (unless the users specify that they intend to use the spawn of forkserver backends of multiprocessing).

The reason that we cannot catch these errors and handle them nicely is that the errors occur even before the new process is fully initialized. This means that the communication pipe between it and the root process is not entirely set up. So the child process fails, raises an exception which is printed to its stderr, the failed child cannot inform the error to the root process because it does not have its communication pipe ready, the root process just finds that its communication pipe is broken and raises a BrokenPipeError exception. We would love to be able to handle these errors better, but they are deeply rooted in how python's multiprocessing package works, so it is a bit out of our hands. For now, my recommendations are:

  1. Try the if __name__ == "__main__": statement
  2. Use sample(cores=1) which does not sample in parallel and avoids multiprocessing entirely.

@david-cortes
Copy link
Author

Unfortunately, it seems adding if __name__ == "__main__" didn't make any difference. Guess the only "solution" is not to use multithreading.

@lucianopaz
Copy link
Contributor

Could you share a minimum example notebook gist that produces your error?

@david-cortes
Copy link
Author

Found an easy way of making it get an error: passing some data with NaN will always end up throwing bad energy. Here's a small snippet:

import numpy as np, pymc3 as pm
a = np.repeat(np.nan, 10)
with pm.Model() as model:
    a_prior_mu = pm.Normal('a_prior_mu', mu = 0, sd = 1)
    a_prior_sd = pm.Gamma('a_prior_sd', alpha = 1, beta = 1)
    a_pm = pm.Normal('a', mu = a_prior_mu, sd = a_prior_sd, shape = 10, observed = a)
    trace = pm.sample()

Put under __name__ == "main":

import numpy as np, pymc3 as pm
a = np.repeat(np.nan, 10)
with pm.Model() as model:
    a_prior_mu = pm.Normal('a_prior_mu', mu = 0, sd = 1)
    a_prior_sd = pm.Gamma('a_prior_sd', alpha = 1, beta = 1)
    a_pm = pm.Normal('a', mu = a_prior_mu, sd = a_prior_sd, shape = 10, observed = a)
with model:
    if __name__ == "__main__":
        trace = pm.sample()

@ricardoV94
Copy link
Member

Do you know if there are any plans to fix this issue?

@michaelosthege
Copy link
Member

Since #4116 the situation has gotten much better. I'll close this issue since the reason and ways to debug it were explained above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants