-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support different contexts in multiprocessing #63199
Comments
The new multiprocessing based on forkserver (bpo-8713) looks great, but it has two problems. The first: "set_start_method() should not be used more than once in the program." The documentation does not explain what the effect of calling it twice would be. Judging from the documentation, it should be possible to do start_method = get_start_method()
if start_method is None:
set_start_method('forkserver') but this code shows the second problem: it always succeeds with the (undocumented!) side-effect of setting the start method in get_start_method, to the system default, if it hasn't been set already. I was just going to forge a patch for joblib (http://pythonhosted.org/joblib/) to set the start method to forkserver at import time. But in the current state of affairs, it would be impossible for the user to safely override the start method before importing joblib, because joblib can't figure out if it's been set already without setting it. The enclosed patch solves the problem by making the new functions more robust:
Unfortunately, I had to hack up the tests a bit, because they were violating the set_start_method contract. There is a test for the new set_start_method behavior, though, and all {fork,forkserver,spawn} tests pass on Linux. |
Related question: is there any good reason that would prevent to pass a custom This would allow libraries such as joblib to keep minimal side effect by try to impact the default multiprocessing runtime as low as possible. |
Cleaned up the patch. |
With your patch, I think if you call get_start_method() without later calling set_start_method() then the helper process(es) will never be started. With the current code, popen.Popen() automatically starts the helper processes if they have not already been started. Also, set_start_method() can have the side-effect of starting helper process(es). I do not really approve of new processes being started as a side-effect of importing a library. But it is reasonable for a library to want a specific start method unless the user demands otherwise. I will have to think this over. BTW, the reason for discouraging using set_start_method() more than once is because some shared resources are created differently depending on what the current start method is. For instance using the fork method semaphores are created and then immediately unlinked. But with the other start methods we must not unlink the semaphore until we are finished with it (while being paranoid about cleanup). Maybe it would be better to have separate contexts for each start method. That way joblib could use the forkserver context without interfering with the rest of the user's program. from multiprocessing import forkserver_context as mp
l = mp.Lock()
p = mp.Process(...)
with mp.Pool() as pool:
... |
In my patched version, the private popen.get_start_method gets a kwarg set_if_needed=True. popen.Popen calls that as before, so its behavior should not change, while the public get_start_method sets the kwarg to False. I realise now that this has the side effect that get_start_method's output changes when multiprocessing has first been used, but then that reflects how the library works. Maybe this should be documented. As for the contexts, those would be great. |
My mistake. |
Yes in general it would be great if libraries could customize the multiprocessing default options without impacting any of the module singletons. That also include the ForkingPickler registry for custom: now it's a class attribute. It would be great to be able to scope custom reducers registration to a given multiprocessing.Pool or multiprocessing.Process instance. Now how to implement that kind of isolation: it could either be done by adding new constructor parameters or new public methods to the Process and Pool classes to be able to customize their behavior while sticking to the OOP paradigm if possible or by using a context manager as you suggested. I am not sure which option is best. Prototyping both is probably the best way to feel the tradeoffs. |
I don't really see the benefit of a context manager over an argument. It's a power user feature anyway, and context managers (at least to me) signal cleanup actions, rather than construction options. |
By "context" I did not really mean a context manager. I just meant an object (possibly a singleton or module) which implements the same interface as multiprocessing. (However, it may be a good idea to also make it a context manager whose __enter__() method starts the helper processes, and whose __exit__() method shuts them down.) |
The process pool executor [1] from the concurrent futures API would be suitable to explicitly start and stop the helper process for the [1] http://docs.python.org/3.4/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor The point would be to have as few state as possible encoded in the multiprocessing module (and its singletons) and move that state information to be directly managed by multiprocessing Process and Pool class instances so that libraries could customize the behavior (start_method, executable, ForkingPIckler reducers registry and so on) without mutating the state of the multiprocessing module singletons. |
There are lots of things that behave differently depending on the currently set start method: Lock(), Semaphore(), Queue(), Value(), ... It is not just when creating a Process or Pool that you need to know the start method. Passing a context or start_method argument to all of these constructors would be very awkward, which is why I think it is better to treat the context as an object with methods Process(), Pool(), Lock(), Semaphore(), etc. Unfortunately, I do not have time to work on this just now... |
Richard Oudkerk: thanks for the clarification, that makes sense. I don't have the time either in the coming month, maybe later. |
Ok. Do you (or jnoller?) have time to review my proposed patch, at least before 3.4 is released? I didn't see it in the release schedule, so it's probably not planned soon, but I wouldn't want the API to change *again* in 3.5. |
I'll review the patch. (According to http://www.python.org/dev/peps/pep-0429/ feature freeze is expected in late November, so there is not too much of rush.) |
Attached is a patch which allows the use of separate contexts. For example
q = ctx.Queue()
p = ctx.Process(target=foo, args=(q,))
p.start()
... Also, get_start_method(allow_none=True) will return None if the start method has not yet been fixed. |
BTW, the context objects are singletons. I could not see a sensible way to make ctx.Process be a picklable class (rather than a method) if there can be multiple instances of a context type. This means that the helper processes survive until the program closes down. |
I haven't read all of your patch yet, but does this mean a forkserver will be started regardless of whether it is later used? That would be a good thing, since starting the fork server after reading in large data sets would mean the fork server would hold on to large swaths of memory even when the data set is deallocated in the master process. |
No, it is started on demand. But since it is started using _posixsbuprocess.fork_exec(), nothing is inherited from the main process. |
Ok, great. |
New changeset 72a5ac909c7a by Richard Oudkerk in branch 'default': |
Thanks, much better than my solution! |
Strange, I can't actually get it to work: >>> from multiprocessing import Pool, get_context
>>> forkserver = get_context('forkserver')
>>> Pool(context=forkserver)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Pool() got an unexpected keyword argument 'context' |
I also tried from multiprocessing.pool import Pool but that died with ImportError: cannot import name get_context |
I guess this should be clarified in the docs, but multiprocessing.pool.Pool is a *class* whose constructor takes a context argument, where as multiprocessing.Pool() is a *bound method* of the default context. (In previous versions multiprocessing.Pool was a *function*.) The only reason you might need the context argument is if you have subclassed multiprocessing.pool.Pool. >>> from multiprocessing import pool, get_context
>>> forkserver = get_context('forkserver')
>>> p = forkserver.Pool()
>>> q = pool.Pool(context=forkserver)
>>> p, q
(<multiprocessing.pool.Pool object at 0xb71f3eec>, <multiprocessing.pool.Pool object at 0xb6edb06c>) I suppose we could just make the bound methods accept a context argument which (if not None) is used instead of self. |
New changeset b941a320601a by R David Murray in branch 'default': |
It seems like this issue has been fixed, so I set its status to closed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: