Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sleeping processes #295

Merged
8 commits merged into from Sep 21, 2018
Merged

Fix sleeping processes #295

8 commits merged into from Sep 21, 2018

Conversation

ghost
Copy link

@ghost ghost commented Aug 17, 2018

The joblib package responsible of the MemmappingPool has been updated to
consider any bugs that could produce the sleeping processes in the
parallel sampler. Also the environment variable JOBLIB_START_METHOD has
been removed since it's not implemented by joblib anymore.
However, if run_experiment is interrupted during the optimization steps,
the sleeping processes are still produced. To fix the problem, the child
processes of the parallel sampler ignore SIGINT so they're not killed
while holding a lock that is also acquired by the parent process,
avoiding a dead lock.
To make sure the child processes are terminated, the SIGINT handler in
the parent process is overridden to call the terminate and join
functions in the processes pool.
The process (thread in TF) used in Plotter is terminated thanks to
registering the method shutdown with function atexit, but one important
step missing was to clean the Queue that interacts with worker process.

@ghost ghost self-assigned this Aug 17, 2018
@ghost ghost requested review from eric-heiden, CatherineSue and jonashen August 17, 2018 15:57
@ghost ghost self-requested a review as a code owner August 17, 2018 15:57
@ryanjulian
Copy link
Member

@jonashen did you test this on your machine?

@jonashen
Copy link
Member

Oops, good point. Sorry.

@@ -27,7 +27,7 @@ dependencies:
- ipdb
- ipywidgets
- jsonmerge
- joblib==0.10.3
- joblib<0.13,>=0.12
Copy link
Member

@jonashen jonashen Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(garage) garage (fix_sleeping_proc *) $ pip install joblib<0.13,>=0.12
bash: 0.13,: No such file or directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try pip install "joblib<0.13,>=0.12"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're specifying a version here? The installed version (0.12.2) is equivalent to if I simply called pip install joblib.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are being more careful with the major changes introduced in joblib, so we are restricting the updates of this library to only minor changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic versioning scheme used by many open source libraries:
XX.YY.ZZ

XX changes -- no backwards compatibility. literally anything can happen. whole packages can disappear.
YY changes -- backwards compatible for the same value of YY. features may be added but not removed. most increments to YY will be small changes but they may sometimes be backwards-incompatible, especially for libraries <1.0.
ZZ changes -- bug fixes/maintenance within a release only. generally no new features.

@@ -84,6 +84,9 @@ def shutdown(self):
if not Plotter.enable:
return
if self._process and self._process.is_alive():
while not self._queue.empty():
self._queue.get()
self._queue.task_done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traceback (most recent call last):
  File "/Users/jonathon/Documents/garage/garage/garage/plotter/plotter.py", line 89, in shutdown
    self._queue.task_done()
AttributeError: 'Queue' object has no attribute 'task_done'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use JoinableQueue instead of Queue. I think python 3.6 clean up the multiprocessing a little bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -2,9 +2,9 @@
import ast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(garage) garage (fix_sleeping_proc *) $ python scripts/run_experiment.py
2018-08-17 09:45:25.883375 PDT | tensorboard data will be logged into:/Users/jonathon/Documents/garage/garage/data/experiment_2018_08_17_09_45_25_880884_PDT_95e7c
Traceback (most recent call last):
  File "scripts/run_experiment.py", line 201, in <module>
    run_experiment(sys.argv)
  File "scripts/run_experiment.py", line 187, in run_experiment
    data = pickle.loads(base64.b64decode(args.args_data))
  File "/anaconda2/envs/garage/lib/python3.6/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/anaconda2/envs/garage/lib/python3.6/base64.py", line 46, in _bytes_from_decode_data
    "string, not %r" % s.__class__.__name__) from None
TypeError: argument should be a bytes-like object or ASCII string, not 'NoneType'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot an argument. run_experiment.py is usually called by garage.misc.instrument.run_experiment

@@ -98,6 +98,9 @@ def _start_worker(self):

def shutdown(self):
if self.worker_thread.is_alive():
while not self.queue.empty():
self.queue.get()
self.queue.task_done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traceback (most recent call last):
  File "examples/tf/trpo_cartpole.py", line 25, in <module>
    algo.train()
  File "/Users/jonathon/Documents/garage/garage/garage/tf/algos/batch_polopt.py", line 146, in train
    self.shutdown_worker()
  File "/Users/jonathon/Documents/garage/garage/garage/tf/algos/batch_polopt.py", line 101, in shutdown_worker
    self.plotter.shutdown()
  File "/Users/jonathon/Documents/garage/garage/garage/tf/plotter/plotter.py", line 103, in shutdown
    self.queue.task_done()
  File "/anaconda2/envs/garage/lib/python3.6/queue.py", line 68, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/anaconda2/envs/garage/lib/python3.6/queue.py", line 68, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this by moving some calls of task_done previously inserted to the right places. Right after the tasks are completed in the worker thread.

Copy link
Member

@jonashen jonashen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to test the sampler.

@ryanjulian
Copy link
Member

ryanjulian commented Aug 17, 2018

Maybe a good exercise is to think about how to test this bug, since it's so severe and recurring.

You can use subprocess in Python to spawn a new process and send it signals (e.g. SIGINT). I think the standard library should also have the tools to crawl the spawned process tree and make sure everything terminates.

You can test the sampler by

  • starting up "run_experiment" with subprocess and having it sample some environment.
  • Once it is started, you crawl the process tree to find the PIDs of all it's child processes.
  • Then you send SIGINT to the "run_experiment" parent process and wait some time
  • Finally, verify all child PIDs are gone

Seeing as this is also a script to reliably reproduce this bug on all platforms, I propose that we write it and then just wrap it in a unit test so this doesn't happen again.

@ryanjulian
Copy link
Member

There is a nice tutorial here about using process groups to jail fork trees created by subprocess. It might be of use.
https://pymotw.com/3/subprocess/

psutil will also let you crawl the process tree.

@ryanjulian
Copy link
Member

Please create a test fixture that can reproduce (and then prove fixed) this bug before we submit.

if args.seed is not None:
set_seed(args.seed)

sigint_hdlr = signal.getsignal(signal.SIGINT)

def terminte_sampler(signum, frame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminte --> terminate


def terminte_sampler(signum, frame):
parallel_sampler.terminate()
parallel_sampler.join()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if parallel_sampler hands after terminate? can we join() with a timeout and take more drastic termination steps if it hangs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the Pool class that MemmappingPool inherits from does not have a join method with timeout parameter.
From the Pool API, close is the nice way to ask to finish all tasks, and terminate just interrupts and finishes right away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok SGTM. We can test this and see how robust it is before going to greater lengths.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually calling join here is wrong since the signal handler is called by the Python main thread and join has an assert call to make sure the parent process is trying to join.
Also, I found some handling that I did for this in a previous fix here. I will add the join call there instead of the overriding the signal handler.


signal.signal(signal.SIGINT, terminte_sampler)

signal.pthread_sigmask(signal.SIG_BLOCK, [signal.SIGINT])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a context manager?

# SIGINT unblocked here
with mask_signal(signal.SIGINT):
    # SIGINT blocked in here
    # do uninterruptible stuff

# SIGINT unblocked again out here

if args.seed is not None:
set_seed(args.seed)

sigint_hdlr = signal.getsignal(signal.SIGINT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a large block comment explaining why this is necessary and how it works.

It seems to me that it might be more appropriate to implement this inside stateful_pool if possible.

@ryanjulian
Copy link
Member

Can you verify that this bug https://bugs.python.org/issue8296 doesn't affect Python 3.5+? Otherwise we may need to change stateful_pool to only use _async mappers.

@ghost
Copy link
Author

ghost commented Aug 21, 2018

Regarding the bug above mentioned, I did an experiment with python 3.6.6 based on this example:

import multiprocessing    
import time    
import os    
     
def create():    
    try:    
        time.sleep(3)     
    except KeyboardInterrupt:    
        print("Exiting child gracefully", os.getpid())        
        return                                                
    return "Finishing child creation %s"%(os.getpid())    
               
def main():          
    def cb(what):                   
        print("Callback:", what)    
                                    
    print("Parent", os.getpid())      
    pool = multiprocessing.Pool(2)    
    try:                      
        for i in range(2):                                    
            pool.apply_async(create, args=(), callback=cb)      
        print("Initialization of child processes requested")    
        pool.close()                                                       
        print("Child processes were requested to finish")        
        pool.join()                        
        print("Child processes joined")    
    except KeyboardInterrupt:                 
        print("Keyboard Interrupt caught")    
        pool.terminate()            
        print("Pool terminated")    
                              
if __name__ == "__main__":    
    main()                                                                     

Without interrupting the execution, the outcome is:

Parent 2578
Initialization of child processes requested
Child processes were requested to finish
Callback: Finishing child creation 2584
Callback: Finishing child creation 2585
Child processes joined

Interrupting the execution:

Parent 3070
Initialization of child processes requested
Child processes were requested to finish
^CKeyboard Interrupt caught
Exiting child gracefully 3072
Exiting child gracefully 3071
Pool terminated

Even though the child processes catch the KeyboardInterruption, they become joinable and the parent process exits without leaving any zombie processes hanging around.

@ryanjulian
Copy link
Member

Alright. To merge this still need a test which reproduces the bug (before the change) and verifies it's gone after.

@ghost
Copy link
Author

ghost commented Aug 21, 2018

I've been thinking in a way to perform the tests for this change. My algorithm would be the following:

  1. Create a subprocess to run an example file using parallel sampler and plotter.
  2. Once the parallel sampler and plotter are up an running, assert the number of child processes based on the configuration of the example file.
  3. After a random time, send SIGINT to the subprocess in step one.
  4. Wait for all the child process to finish, and assert that the number of child processes are zero.

To make sure we're trying this test in different execution points, we can run the above sequence for a certain number of iterations, but please let me know what could be improved.

One of the things I'm not so sure about is how to detect that the child processes under the parallel sampler are running. How I'm currently doing this in my sandbox is reading stdout and catching the string "Populated", which is written once all the child processes finished their initialization. Please let me know if you have a better idea to do this.

@ryanjulian
Copy link
Member

Some thoughts on this plan:

  • Tests should be deterministic so that they can reliably detect bugs when run once. Choosing a random time point to stop the process will automatically make your test flaky.
  • You should always test your desired outcome directly if at all possible. Our desired outcome is that there are 0 child processes for run_experiment after it receives a SIGINT. This is easiest to verify IMO if we enumerate all of the child processes when it launches and then verify they are all gone when it terminates. It also means if you detect an orphan you can tell the user which it is. The number and sources of child processes generated by run_experiment might change as the code evolves, but we will always want all of them to terminate on SIGINT.
  • There are actually several test cases here. Don't try to blur them with a giant integration test. At a minimum, we need to verify successful termination at several points in the run_experiment lifecycle (setup, sampling, optimizing, shutdown).
  • Don't use pipe hacks and other things which read stdout to figure out the process tree. We can change the fact that new processes print "populated" at any time and your test would failing silently. The plotter process doesn't even print this currently. Unix and Python provide you with ample tools to enumerate the process tree created by your subprocess, and I have provided pointers to them. Use the right tool for the job and you will only ever have to write this once.

@ghost
Copy link
Author

ghost commented Aug 23, 2018

I added the test, but now I'm wondering about three things:

  1. Add a message to the user when a forced finalization is produced. Currently only a forced finalization is executed since the join call for Pool does not have a timeout parameter that would wake up the process if the nice finalization never returns. Maybe we could use a polling approach instead by checking on each process in the pool instead of calling join, and sleeping in relative small intervals to avoid hogging the CPU.
  2. Add more malicious cases. One case I have in mind is sending SIGINT not to the launcher but to any of its children and check all of them die. If you have more cases in mind, let me know and I will implement them.
  3. Right now I added the test considering Theano, but I will also add the same for TF to make sure there's no errors on that side.

@ghost
Copy link
Author

ghost commented Aug 23, 2018

It seems my test is not behaving the same way as in my working area. As shown at the bottom of these logs, the tests hangs.
It seems that SIGINT is propagated to all children even if the signal is masked when they're created, since there's a KeyboardInterrupt traceback for all processes, but I am not really sure in which call of the test it's getting stuck, since the call that waits for the children processes to die has a timeout.

@ghost ghost force-pushed the fix_sleeping_proc branch 3 times, most recently from 2da09c0 to f93c359 Compare August 23, 2018 21:27
@ghost ghost force-pushed the fix_sleeping_proc branch 4 times, most recently from c97ef82 to ad23f72 Compare August 28, 2018 02:45
@coveralls
Copy link

coveralls commented Aug 28, 2018

Coverage Status

Coverage decreased (-0.0006%) to 61.868% when pulling 83a4813 on fix_sleeping_proc into de91deb on master.

@ryanjulian
Copy link
Member

Does parallel_sampler.terminate() block? If not, I think the best option here is to busy wait for all the children to disappear, the either join() once they have disappeared, or exit with an error if they don't disappear within some timeout.

@ghost
Copy link
Author

ghost commented Aug 29, 2018

Okay, I have added the corresponding calls to shutdown the parallel sampler and plotters in run_experiment. However, when run_experiment runs under test_sigint_theano, a bug happens in some tests, where some sleeping processes remain and the corresponding user warning is produced. The output comes with a traceback that points to the fork of the process that is remaining:

2018-08-28 23:09:36.891154 PDT | Perplexity                 4.13273
2018-08-28 23:09:36.891207 PDT | StdReturn                 65.6678
2018-08-28 23:09:36.891261 PDT | dLoss                      0.0341225
2018-08-28 23:09:36.891313 PDT | -----------------------  -------------
Traceback (most recent call last):
  File "tests/fixtures/theano/trpo_cartpole_instrumented.py", line 38, in <module>
    plot=True,
  File "/home/aigonzal/ivanWorkspace/garage/garage/misc/instrument.py", line 524, in run_experiment
    command, shell=True, env=dict(os.environ, **env))
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 269, in call
    return p.wait(timeout=timeout)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
The following processes didn't die after the shutdown of run_experiment:
{'status': 'sleeping', 'name': 'python', 'pid': 12288}
This is a sign of an unclean shutdown. Please reopen the following issue 
 with a detailed description of how the error was produced:
https://github.com/rlworkgroup/garage/issues/120
Traceback (most recent call last):
  File "/home/aigonzal/ivanWorkspace/garage/scripts/run_experiment.py", line 235, in <module>
    run_experiment(sys.argv)
  File "/home/aigonzal/ivanWorkspace/garage/scripts/run_experiment.py", line 185, in run_experiment
    method_call(variant_data)
  File "tests/fixtures/theano/trpo_cartpole_instrumented.py", line 26, in run_task
    algo.train()
  File "/home/aigonzal/ivanWorkspace/garage/tests/fixtures/theano/batch_polopt_instrumented.py", line 35, in train
    paths = self.sampler.obtain_samples(itr)
  File "/home/aigonzal/ivanWorkspace/garage/garage/algos/batch_polopt.py", line 28, in obtain_samples
    scope=self.algo.scope,
  File "/home/aigonzal/ivanWorkspace/garage/garage/sampler/parallel_sampler.py", line 130, in sample_paths
    show_prog_bar=True)
  File "/home/aigonzal/ivanWorkspace/garage/garage/sampler/stateful_pool.py", line 162, in run_collect
    manager = mp.Manager()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/context.py", line 56, in Manager
    m.start()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/managers.py", line 513, in start
    self._process.start()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
KeyboardInterrupt

It seems the problem is related to the use of psutil.wait_procs in both run_experiment and test_sigint_theano at the same time, but I need to look further into this.
This bug does not happen when a launcher is used.


class Plotter:

# Static variable used to disable the plotter
enable = True

def __init__(self):
def __init__(self, standalone=False):
__plotters__.append(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a dunder (__foo__). Those are reserved for Python internals. __plotters should be fine.

@@ -21,13 +21,16 @@ class Op(Enum):

Message = namedtuple("Message", ["op", "args", "kwargs"])

__plotters__ = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a class member of Plotter rather than package-global.

See https://stackoverflow.com/a/12102666 for a useful pattern.

@@ -186,5 +202,34 @@ def run_experiment(argv):
logger.pop_prefix()


def child_proc_shutdown(shutdown_sampler=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the Plotter API to use close() (or the sampler API to use shutdown()) then this could be called as

def child_proc_shutdown(children):
    run_exp_proc = psutil.Process()
    alive = run_exp_proc.children(recursive=True)
    for c in children:
        c.shutdown()
    # etc...

# example at the callsite
child_proc_shutdown(__plotters__ + [parallel_sampler])

@@ -2,7 +2,7 @@

# Python packages considered local to the project, for the purposes of import
# order checking. Comma-delimited.
garage_packages="garage,sandbox,examples,tests"
garage_packages="tests,garage,sandbox,examples,contrib"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contrib was removed for a reason--it not longer exists

setup.cfg Outdated
@@ -4,7 +4,7 @@
# style, this rule is ignored.
ignore = W503
import-order-style = google
application-import-names = sandbox,garage,examples,tests
application-import-names = tests,sandbox,garage,examples,contrib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove contrib

@@ -0,0 +1,4 @@
from tests.fixtures.theano.batch_polopt_instrumented import \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 says to use () to split lines instead of \

@@ -0,0 +1,4 @@
from tests.fixtures.theano.batch_polopt_instrumented import \
InstrumentedBatchPolopt
from tests.fixtures.theano.npo_instrumented import InstrumentedNPO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package names should be instrumented_batchpolopt, instrumented_npo and instrumented_trpo to match class names.

@@ -0,0 +1,81 @@
from enum import Enum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this file in tests/integration_tests

@@ -0,0 +1,81 @@
from enum import Enum
from multiprocessing.connection import Listener
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an equivalent for TensorFlow?

class TestSigInt(unittest.TestCase):
def test_sigint(self):
"""Interrupt the experiment in different stages of its lifecyle."""
for stage in list(ExpLifecycle):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a parameterized test instead of for loop https://nose2.readthedocs.io/en/latest/params.html

os.kill(child.pid, signal.SIGINT)

if not clean_exit:
raise AssertionError(colorize(error_msg, "red"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't raise exceptions in tests. Just use assert.

@codecov
Copy link

codecov bot commented Aug 30, 2018

Codecov Report

Merging #295 into master will decrease coverage by <.01%.
The diff coverage is 49.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   61.41%   61.41%   -0.01%     
==========================================
  Files         213      213              
  Lines       14312    14328      +16     
==========================================
+ Hits         8790     8799       +9     
- Misses       5522     5529       +7
Impacted Files Coverage Δ
garage/misc/instrument.py 29.01% <ø> (+0.04%) ⬆️
garage/tf/algos/ddpg.py 8.61% <0%> (ø) ⬆️
garage/tf/algos/batch_polopt.py 90.42% <100%> (ø) ⬆️
garage/algos/cem.py 94.54% <100%> (ø) ⬆️
garage/theano/algos/ddpg.py 93.87% <100%> (ø) ⬆️
garage/algos/cma_es.py 93.97% <100%> (ø) ⬆️
garage/envs/box2d/parser/xml_types.py 84% <100%> (ø) ⬆️
garage/tf/envs/parallel_vec_env_executor.py 14.28% <15.38%> (ø) ⬆️
garage/sampler/parallel_sampler.py 71.26% <50%> (ø) ⬆️
garage/misc/logger.py 43.01% <50%> (+0.21%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d37327...d3adb1a. Read the comment docs.

Angel Gonzalez added 7 commits September 18, 2018 13:19
The joblib package responsible of the MemmappingPool has been updated to
consider any bugs that could produce the sleeping processes in the
parallel sampler. Also the environment variable JOBLIB_START_METHOD has
been removed since it's not implemented by joblib anymore.
However, if run_experiment is interrupted during the optimization steps,
the sleeping processes are still produced. To fix the problem, the child
processes of the parallel sampler ignore SIGINT so they're not killed
while holding a lock that is also acquired by the parent process,
avoiding a dead lock.
To make sure the child processes are terminated, the SIGINT handler in
the parent process is overridden to call the terminate and join
functions in the processes pool.
The process (thread in TF) used in Plotter is terminated thanks to
registering the method shutdown with function atexit, but one important
step missing was to clean the Queue that interacts with worker process.
The class BatchPolopt has been overridden as BatchPoloptCallback to
notify the test of the different stages in the experiment life cycle so
it can be interrupted with SIGINT.
The test makes sure that the children processes produce are zero after
the SIGINT is sent, or it throws an assertion error with those processes
that didn't die.
Also the context manager MasksSignals has been created to handle the
masking of SIGINT in parallel_sampler.
Also, some of codacy issues were solved, as well as some legacy pylint
and flake8 issues.
All plotters are appended to a static list, so they can be easily
reachable from run_experiment to call shutdown on them.
Also, terminate was replace by close to shutdown the parallel sampler,
since terminate calls join and may block the shutdown in run_experiment.
In order to check that all processes died after a time out, a loop to
poll for alive processes was implemented.
A warning message for processes that remain after shutdown is printed so
users of garage can reopen the corresponding issue.
Other changes include the file renaming of instrumented policy
optimizers, as well as renaming the shutdown method in plotters to close
in order to close all processes under run_experiment with the same
method.
Otherwise, comparing two enumerations assigned to variables does not
work.
@codecov
Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #295 into master will decrease coverage by <.01%.
The diff coverage is 49.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   61.32%   61.31%   -0.01%     
==========================================
  Files         213      213              
  Lines       14316    14332      +16     
==========================================
+ Hits         8779     8788       +9     
- Misses       5537     5544       +7
Impacted Files Coverage Δ
garage/misc/instrument.py 29.01% <ø> (+0.04%) ⬆️
garage/tf/algos/ddpg.py 8.61% <0%> (ø) ⬆️
garage/tf/algos/batch_polopt.py 90.42% <100%> (ø) ⬆️
garage/algos/cem.py 94.54% <100%> (ø) ⬆️
garage/theano/algos/ddpg.py 93.87% <100%> (ø) ⬆️
garage/algos/cma_es.py 93.97% <100%> (ø) ⬆️
garage/envs/box2d/parser/xml_types.py 84% <100%> (ø) ⬆️
garage/tf/envs/parallel_vec_env_executor.py 14.28% <15.38%> (ø) ⬆️
garage/sampler/parallel_sampler.py 71.26% <50%> (ø) ⬆️
garage/misc/logger.py 43.01% <50%> (+0.21%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de91deb...83a4813. Read the comment docs.

A process known as semaphore tracker is spawned from the run_experiment
process, but we cannot stop this process as it ignores SIGINT and
SIGTERM and we haven't access to it. Therefore, it's removed from the
list of children to wait for in both run_experiment and in test_sigint.
Another process that is spawned from run_experiment is the Manager,
which owns multiprocessing objects such as RLocks and counters used
during the run_collect method in the StatefulPool class. This process
was also making the warning message appear and the test fail.
However, the manager has shutdown method to terminate the process, so we
can verify the termination of the process.
@ghost ghost merged commit d5b9e63 into master Sep 21, 2018
@ryanjulian
Copy link
Member

NICE

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants