Fix sleeping processes #295

ghost · 2018-08-17T15:57:47Z

The joblib package responsible of the MemmappingPool has been updated to
consider any bugs that could produce the sleeping processes in the
parallel sampler. Also the environment variable JOBLIB_START_METHOD has
been removed since it's not implemented by joblib anymore.
However, if run_experiment is interrupted during the optimization steps,
the sleeping processes are still produced. To fix the problem, the child
processes of the parallel sampler ignore SIGINT so they're not killed
while holding a lock that is also acquired by the parent process,
avoiding a dead lock.
To make sure the child processes are terminated, the SIGINT handler in
the parent process is overridden to call the terminate and join
functions in the processes pool.
The process (thread in TF) used in Plotter is terminated thanks to
registering the method shutdown with function atexit, but one important
step missing was to clean the Queue that interacts with worker process.

ryanjulian · 2018-08-17T16:28:25Z

@jonashen did you test this on your machine?

jonashen · 2018-08-17T16:42:57Z

Oops, good point. Sorry.

jonashen · 2018-08-17T16:43:32Z

environment.yml

@@ -27,7 +27,7 @@ dependencies:
        - ipdb
        - ipywidgets
        - jsonmerge
-        - joblib==0.10.3
+        - joblib<0.13,>=0.12


(garage) garage (fix_sleeping_proc *) $ pip install joblib<0.13,>=0.12 bash: 0.13,: No such file or directory

try pip install "joblib<0.13,>=0.12"

Is there a reason we're specifying a version here? The installed version (0.12.2) is equivalent to if I simply called pip install joblib.

We are being more careful with the major changes introduced in joblib, so we are restricting the updates of this library to only minor changes.

The semantic versioning scheme used by many open source libraries:
XX.YY.ZZ

XX changes -- no backwards compatibility. literally anything can happen. whole packages can disappear.
YY changes -- backwards compatible for the same value of YY. features may be added but not removed. most increments to YY will be small changes but they may sometimes be backwards-incompatible, especially for libraries <1.0.
ZZ changes -- bug fixes/maintenance within a release only. generally no new features.

jonashen · 2018-08-17T16:44:20Z

garage/plotter/plotter.py

@@ -84,6 +84,9 @@ def shutdown(self):
        if not Plotter.enable:
            return
        if self._process and self._process.is_alive():
+            while not self._queue.empty():
+                self._queue.get()
+                self._queue.task_done()


Traceback (most recent call last): File "/Users/jonathon/Documents/garage/garage/garage/plotter/plotter.py", line 89, in shutdown self._queue.task_done() AttributeError: 'Queue' object has no attribute 'task_done'

Use JoinableQueue instead of Queue. I think python 3.6 clean up the multiprocessing a little bit.

jonashen · 2018-08-17T16:45:50Z

scripts/run_experiment.py

@@ -2,9 +2,9 @@
 import ast


(garage) garage (fix_sleeping_proc *) $ python scripts/run_experiment.py 2018-08-17 09:45:25.883375 PDT | tensorboard data will be logged into:/Users/jonathon/Documents/garage/garage/data/experiment_2018_08_17_09_45_25_880884_PDT_95e7c Traceback (most recent call last): File "scripts/run_experiment.py", line 201, in <module> run_experiment(sys.argv) File "scripts/run_experiment.py", line 187, in run_experiment data = pickle.loads(base64.b64decode(args.args_data)) File "/anaconda2/envs/garage/lib/python3.6/base64.py", line 80, in b64decode s = _bytes_from_decode_data(s) File "/anaconda2/envs/garage/lib/python3.6/base64.py", line 46, in _bytes_from_decode_data "string, not %r" % s.__class__.__name__) from None TypeError: argument should be a bytes-like object or ASCII string, not 'NoneType'

You forgot an argument. run_experiment.py is usually called by garage.misc.instrument.run_experiment

jonashen · 2018-08-17T16:46:30Z

garage/tf/plotter/plotter.py

@@ -98,6 +98,9 @@ def _start_worker(self):

    def shutdown(self):
        if self.worker_thread.is_alive():
+            while not self.queue.empty():
+                self.queue.get()
+                self.queue.task_done()


Traceback (most recent call last): File "examples/tf/trpo_cartpole.py", line 25, in <module> algo.train() File "/Users/jonathon/Documents/garage/garage/garage/tf/algos/batch_polopt.py", line 146, in train self.shutdown_worker() File "/Users/jonathon/Documents/garage/garage/garage/tf/algos/batch_polopt.py", line 101, in shutdown_worker self.plotter.shutdown() File "/Users/jonathon/Documents/garage/garage/garage/tf/plotter/plotter.py", line 103, in shutdown self.queue.task_done() File "/anaconda2/envs/garage/lib/python3.6/queue.py", line 68, in task_done raise ValueError('task_done() called too many times') ValueError: task_done() called too many times Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/anaconda2/envs/garage/lib/python3.6/queue.py", line 68, in task_done raise ValueError('task_done() called too many times') ValueError: task_done() called too many times

I fixed this by moving some calls of task_done previously inserted to the right places. Right after the tasks are completed in the worker thread.

jonashen

I wasn't sure how to test the sampler.

ryanjulian · 2018-08-17T16:55:05Z

Maybe a good exercise is to think about how to test this bug, since it's so severe and recurring.

You can use subprocess in Python to spawn a new process and send it signals (e.g. SIGINT). I think the standard library should also have the tools to crawl the spawned process tree and make sure everything terminates.

You can test the sampler by

starting up "run_experiment" with subprocess and having it sample some environment.
Once it is started, you crawl the process tree to find the PIDs of all it's child processes.
Then you send SIGINT to the "run_experiment" parent process and wait some time
Finally, verify all child PIDs are gone

Seeing as this is also a script to reliably reproduce this bug on all platforms, I propose that we write it and then just wrap it in a unit test so this doesn't happen again.

ryanjulian · 2018-08-17T19:12:18Z

There is a nice tutorial here about using process groups to jail fork trees created by subprocess. It might be of use.
https://pymotw.com/3/subprocess/

psutil will also let you crawl the process tree.

ryanjulian · 2018-08-20T17:46:03Z

Please create a test fixture that can reproduce (and then prove fixed) this bug before we submit.

ryanjulian · 2018-08-20T21:20:30Z

scripts/run_experiment.py

    if args.seed is not None:
        set_seed(args.seed)

+    sigint_hdlr = signal.getsignal(signal.SIGINT)
+
+    def terminte_sampler(signum, frame):


terminte --> terminate

ryanjulian · 2018-08-20T21:22:50Z

scripts/run_experiment.py

+
+    def terminte_sampler(signum, frame):
+        parallel_sampler.terminate()
+        parallel_sampler.join()


what if parallel_sampler hands after terminate? can we join() with a timeout and take more drastic termination steps if it hangs?

Unfortunately the Pool class that MemmappingPool inherits from does not have a join method with timeout parameter.
From the Pool API, close is the nice way to ask to finish all tasks, and terminate just interrupts and finishes right away.

Ok SGTM. We can test this and see how robust it is before going to greater lengths.

Actually calling join here is wrong since the signal handler is called by the Python main thread and join has an assert call to make sure the parent process is trying to join.
Also, I found some handling that I did for this in a previous fix here. I will add the join call there instead of the overriding the signal handler.

ryanjulian · 2018-08-20T21:30:54Z

scripts/run_experiment.py

+
+    signal.signal(signal.SIGINT, terminte_sampler)
+
+    signal.pthread_sigmask(signal.SIG_BLOCK, [signal.SIGINT])


can you make this a context manager?

# SIGINT unblocked here with mask_signal(signal.SIGINT): # SIGINT blocked in here # do uninterruptible stuff # SIGINT unblocked again out here

ryanjulian · 2018-08-20T21:39:46Z

scripts/run_experiment.py

    if args.seed is not None:
        set_seed(args.seed)

+    sigint_hdlr = signal.getsignal(signal.SIGINT)


Please include a large block comment explaining why this is necessary and how it works.

It seems to me that it might be more appropriate to implement this inside stateful_pool if possible.

ryanjulian · 2018-08-20T21:41:49Z

Can you verify that this bug https://bugs.python.org/issue8296 doesn't affect Python 3.5+? Otherwise we may need to change stateful_pool to only use _async mappers.

ghost · 2018-08-21T16:42:22Z

Regarding the bug above mentioned, I did an experiment with python 3.6.6 based on this example:

import multiprocessing    
import time    
import os    
     
def create():    
    try:    
        time.sleep(3)     
    except KeyboardInterrupt:    
        print("Exiting child gracefully", os.getpid())        
        return                                                
    return "Finishing child creation %s"%(os.getpid())    
               
def main():          
    def cb(what):                   
        print("Callback:", what)    
                                    
    print("Parent", os.getpid())      
    pool = multiprocessing.Pool(2)    
    try:                      
        for i in range(2):                                    
            pool.apply_async(create, args=(), callback=cb)      
        print("Initialization of child processes requested")    
        pool.close()                                                       
        print("Child processes were requested to finish")        
        pool.join()                        
        print("Child processes joined")    
    except KeyboardInterrupt:                 
        print("Keyboard Interrupt caught")    
        pool.terminate()            
        print("Pool terminated")    
                              
if __name__ == "__main__":    
    main()

Without interrupting the execution, the outcome is:

Parent 2578
Initialization of child processes requested
Child processes were requested to finish
Callback: Finishing child creation 2584
Callback: Finishing child creation 2585
Child processes joined

Interrupting the execution:

Parent 3070
Initialization of child processes requested
Child processes were requested to finish
^CKeyboard Interrupt caught
Exiting child gracefully 3072
Exiting child gracefully 3071
Pool terminated

Even though the child processes catch the KeyboardInterruption, they become joinable and the parent process exits without leaving any zombie processes hanging around.

ryanjulian · 2018-08-21T16:55:03Z

Alright. To merge this still need a test which reproduces the bug (before the change) and verifies it's gone after.

ghost · 2018-08-21T17:02:29Z

I've been thinking in a way to perform the tests for this change. My algorithm would be the following:

Create a subprocess to run an example file using parallel sampler and plotter.
Once the parallel sampler and plotter are up an running, assert the number of child processes based on the configuration of the example file.
After a random time, send SIGINT to the subprocess in step one.
Wait for all the child process to finish, and assert that the number of child processes are zero.

To make sure we're trying this test in different execution points, we can run the above sequence for a certain number of iterations, but please let me know what could be improved.

One of the things I'm not so sure about is how to detect that the child processes under the parallel sampler are running. How I'm currently doing this in my sandbox is reading stdout and catching the string "Populated", which is written once all the child processes finished their initialization. Please let me know if you have a better idea to do this.

ryanjulian · 2018-08-21T17:18:37Z

Some thoughts on this plan:

Tests should be deterministic so that they can reliably detect bugs when run once. Choosing a random time point to stop the process will automatically make your test flaky.
You should always test your desired outcome directly if at all possible. Our desired outcome is that there are 0 child processes for run_experiment after it receives a SIGINT. This is easiest to verify IMO if we enumerate all of the child processes when it launches and then verify they are all gone when it terminates. It also means if you detect an orphan you can tell the user which it is. The number and sources of child processes generated by run_experiment might change as the code evolves, but we will always want all of them to terminate on SIGINT.
There are actually several test cases here. Don't try to blur them with a giant integration test. At a minimum, we need to verify successful termination at several points in the run_experiment lifecycle (setup, sampling, optimizing, shutdown).
Don't use pipe hacks and other things which read stdout to figure out the process tree. We can change the fact that new processes print "populated" at any time and your test would failing silently. The plotter process doesn't even print this currently. Unix and Python provide you with ample tools to enumerate the process tree created by your subprocess, and I have provided pointers to them. Use the right tool for the job and you will only ever have to write this once.

ghost · 2018-08-23T03:55:46Z

I added the test, but now I'm wondering about three things:

Add a message to the user when a forced finalization is produced. Currently only a forced finalization is executed since the join call for Pool does not have a timeout parameter that would wake up the process if the nice finalization never returns. Maybe we could use a polling approach instead by checking on each process in the pool instead of calling join, and sleeping in relative small intervals to avoid hogging the CPU.
Add more malicious cases. One case I have in mind is sending SIGINT not to the launcher but to any of its children and check all of them die. If you have more cases in mind, let me know and I will implement them.
Right now I added the test considering Theano, but I will also add the same for TF to make sure there's no errors on that side.

ghost · 2018-08-23T17:30:52Z

It seems my test is not behaving the same way as in my working area. As shown at the bottom of these logs, the tests hangs.
It seems that SIGINT is propagated to all children even if the signal is masked when they're created, since there's a KeyboardInterrupt traceback for all processes, but I am not really sure in which call of the test it's getting stuck, since the call that waits for the children processes to die has a timeout.

coveralls · 2018-08-28T02:46:14Z

Coverage decreased (-0.0006%) to 61.868% when pulling 83a4813 on fix_sleeping_proc into de91deb on master.

ryanjulian · 2018-08-28T16:31:25Z

Does parallel_sampler.terminate() block? If not, I think the best option here is to busy wait for all the children to disappear, the either join() once they have disappeared, or exit with an error if they don't disappear within some timeout.

ghost · 2018-08-29T06:36:20Z

Okay, I have added the corresponding calls to shutdown the parallel sampler and plotters in run_experiment. However, when run_experiment runs under test_sigint_theano, a bug happens in some tests, where some sleeping processes remain and the corresponding user warning is produced. The output comes with a traceback that points to the fork of the process that is remaining:

2018-08-28 23:09:36.891154 PDT | Perplexity                 4.13273
2018-08-28 23:09:36.891207 PDT | StdReturn                 65.6678
2018-08-28 23:09:36.891261 PDT | dLoss                      0.0341225
2018-08-28 23:09:36.891313 PDT | -----------------------  -------------
Traceback (most recent call last):
  File "tests/fixtures/theano/trpo_cartpole_instrumented.py", line 38, in <module>
    plot=True,
  File "/home/aigonzal/ivanWorkspace/garage/garage/misc/instrument.py", line 524, in run_experiment
    command, shell=True, env=dict(os.environ, **env))
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 269, in call
    return p.wait(timeout=timeout)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
The following processes didn't die after the shutdown of run_experiment:
{'status': 'sleeping', 'name': 'python', 'pid': 12288}
This is a sign of an unclean shutdown. Please reopen the following issue 
 with a detailed description of how the error was produced:
https://github.com/rlworkgroup/garage/issues/120
Traceback (most recent call last):
  File "/home/aigonzal/ivanWorkspace/garage/scripts/run_experiment.py", line 235, in <module>
    run_experiment(sys.argv)
  File "/home/aigonzal/ivanWorkspace/garage/scripts/run_experiment.py", line 185, in run_experiment
    method_call(variant_data)
  File "tests/fixtures/theano/trpo_cartpole_instrumented.py", line 26, in run_task
    algo.train()
  File "/home/aigonzal/ivanWorkspace/garage/tests/fixtures/theano/batch_polopt_instrumented.py", line 35, in train
    paths = self.sampler.obtain_samples(itr)
  File "/home/aigonzal/ivanWorkspace/garage/garage/algos/batch_polopt.py", line 28, in obtain_samples
    scope=self.algo.scope,
  File "/home/aigonzal/ivanWorkspace/garage/garage/sampler/parallel_sampler.py", line 130, in sample_paths
    show_prog_bar=True)
  File "/home/aigonzal/ivanWorkspace/garage/garage/sampler/stateful_pool.py", line 162, in run_collect
    manager = mp.Manager()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/context.py", line 56, in Manager
    m.start()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/managers.py", line 513, in start
    self._process.start()
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/aigonzal/miniconda2/envs/garage/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
KeyboardInterrupt

It seems the problem is related to the use of psutil.wait_procs in both run_experiment and test_sigint_theano at the same time, but I need to look further into this.
This bug does not happen when a launcher is used.

ryanjulian · 2018-08-29T18:24:18Z

garage/plotter/plotter.py


 class Plotter:

    # Static variable used to disable the plotter
    enable = True

-    def __init__(self):
+    def __init__(self, standalone=False):
+        __plotters__.append(self)


don't use a dunder (__foo__). Those are reserved for Python internals. __plotters should be fine.

ryanjulian · 2018-08-29T18:26:32Z

garage/plotter/plotter.py

@@ -21,13 +21,16 @@ class Op(Enum):

 Message = namedtuple("Message", ["op", "args", "kwargs"])

+__plotters__ = []


This can be a class member of Plotter rather than package-global.

See https://stackoverflow.com/a/12102666 for a useful pattern.

ryanjulian · 2018-08-29T18:31:27Z

scripts/run_experiment.py

@@ -186,5 +202,34 @@ def run_experiment(argv):
    logger.pop_prefix()


+def child_proc_shutdown(shutdown_sampler=False):


If you change the Plotter API to use close() (or the sampler API to use shutdown()) then this could be called as

def child_proc_shutdown(children): run_exp_proc = psutil.Process() alive = run_exp_proc.children(recursive=True) for c in children: c.shutdown() # etc... # example at the callsite child_proc_shutdown(__plotters__ + [parallel_sampler])

ryanjulian · 2018-08-29T18:31:43Z

scripts/travisci/check_flake8.sh

@@ -2,7 +2,7 @@

 # Python packages considered local to the project, for the purposes of import
 # order checking. Comma-delimited.
-garage_packages="garage,sandbox,examples,tests"
+garage_packages="tests,garage,sandbox,examples,contrib"


contrib was removed for a reason--it not longer exists

ryanjulian · 2018-08-29T18:31:55Z

setup.cfg

@@ -4,7 +4,7 @@
 # style, this rule is ignored.
 ignore = W503
 import-order-style = google
-application-import-names = sandbox,garage,examples,tests
+application-import-names = tests,sandbox,garage,examples,contrib


Remove contrib

ryanjulian · 2018-08-29T18:32:11Z

tests/fixtures/theano/__init__.py

@@ -0,0 +1,4 @@
+from tests.fixtures.theano.batch_polopt_instrumented import \


PEP8 says to use () to split lines instead of \

ryanjulian · 2018-08-29T18:36:12Z

tests/fixtures/theano/__init__.py

@@ -0,0 +1,4 @@
+from tests.fixtures.theano.batch_polopt_instrumented import \
+        InstrumentedBatchPolopt
+from tests.fixtures.theano.npo_instrumented import InstrumentedNPO


Package names should be instrumented_batchpolopt, instrumented_npo and instrumented_trpo to match class names.

ryanjulian · 2018-08-29T18:37:57Z

tests/sampler/test_sigint_theano.py

@@ -0,0 +1,81 @@
+from enum import Enum


Please put this file in tests/integration_tests

ryanjulian · 2018-08-29T18:39:54Z

tests/sampler/test_sigint_theano.py

@@ -0,0 +1,81 @@
+from enum import Enum
+from multiprocessing.connection import Listener
+import os


Can you add an equivalent for TensorFlow?

ryanjulian · 2018-08-29T18:41:16Z

tests/sampler/test_sigint_theano.py

+class TestSigInt(unittest.TestCase):
+    def test_sigint(self):
+        """Interrupt the experiment in different stages of its lifecyle."""
+        for stage in list(ExpLifecycle):


use a parameterized test instead of for loop https://nose2.readthedocs.io/en/latest/params.html

ryanjulian · 2018-08-29T18:41:58Z

tests/sampler/test_sigint_theano.py

+        os.kill(child.pid, signal.SIGINT)
+
+    if not clean_exit:
+        raise AssertionError(colorize(error_msg, "red"))


Don't raise exceptions in tests. Just use assert.

codecov · 2018-08-30T22:30:53Z

Codecov Report

Merging #295 into master will decrease coverage by <.01%.
The diff coverage is 49.23%.

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   61.41%   61.41%   -0.01%     
==========================================
  Files         213      213              
  Lines       14312    14328      +16     
==========================================
+ Hits         8790     8799       +9     
- Misses       5522     5529       +7

Impacted Files	Coverage Δ
garage/misc/instrument.py	`29.01% <ø> (+0.04%)`	⬆️
garage/tf/algos/ddpg.py	`8.61% <0%> (ø)`	⬆️
garage/tf/algos/batch_polopt.py	`90.42% <100%> (ø)`	⬆️
garage/algos/cem.py	`94.54% <100%> (ø)`	⬆️
garage/theano/algos/ddpg.py	`93.87% <100%> (ø)`	⬆️
garage/algos/cma_es.py	`93.97% <100%> (ø)`	⬆️
garage/envs/box2d/parser/xml_types.py	`84% <100%> (ø)`	⬆️
garage/tf/envs/parallel_vec_env_executor.py	`14.28% <15.38%> (ø)`	⬆️
garage/sampler/parallel_sampler.py	`71.26% <50%> (ø)`	⬆️
garage/misc/logger.py	`43.01% <50%> (+0.21%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d37327...d3adb1a. Read the comment docs.

The joblib package responsible of the MemmappingPool has been updated to consider any bugs that could produce the sleeping processes in the parallel sampler. Also the environment variable JOBLIB_START_METHOD has been removed since it's not implemented by joblib anymore. However, if run_experiment is interrupted during the optimization steps, the sleeping processes are still produced. To fix the problem, the child processes of the parallel sampler ignore SIGINT so they're not killed while holding a lock that is also acquired by the parent process, avoiding a dead lock. To make sure the child processes are terminated, the SIGINT handler in the parent process is overridden to call the terminate and join functions in the processes pool. The process (thread in TF) used in Plotter is terminated thanks to registering the method shutdown with function atexit, but one important step missing was to clean the Queue that interacts with worker process.

The class BatchPolopt has been overridden as BatchPoloptCallback to notify the test of the different stages in the experiment life cycle so it can be interrupted with SIGINT. The test makes sure that the children processes produce are zero after the SIGINT is sent, or it throws an assertion error with those processes that didn't die. Also the context manager MasksSignals has been created to handle the masking of SIGINT in parallel_sampler.

Also, some of codacy issues were solved, as well as some legacy pylint and flake8 issues.

All plotters are appended to a static list, so they can be easily reachable from run_experiment to call shutdown on them. Also, terminate was replace by close to shutdown the parallel sampler, since terminate calls join and may block the shutdown in run_experiment. In order to check that all processes died after a time out, a loop to poll for alive processes was implemented. A warning message for processes that remain after shutdown is printed so users of garage can reopen the corresponding issue.

Other changes include the file renaming of instrumented policy optimizers, as well as renaming the shutdown method in plotters to close in order to close all processes under run_experiment with the same method.

Otherwise, comparing two enumerations assigned to variables does not work.

codecov · 2018-09-18T20:19:54Z

Codecov Report

Merging #295 into master will decrease coverage by <.01%.
The diff coverage is 49.23%.

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   61.32%   61.31%   -0.01%     
==========================================
  Files         213      213              
  Lines       14316    14332      +16     
==========================================
+ Hits         8779     8788       +9     
- Misses       5537     5544       +7

Impacted Files	Coverage Δ
garage/misc/instrument.py	`29.01% <ø> (+0.04%)`	⬆️
garage/tf/algos/ddpg.py	`8.61% <0%> (ø)`	⬆️
garage/tf/algos/batch_polopt.py	`90.42% <100%> (ø)`	⬆️
garage/algos/cem.py	`94.54% <100%> (ø)`	⬆️
garage/theano/algos/ddpg.py	`93.87% <100%> (ø)`	⬆️
garage/algos/cma_es.py	`93.97% <100%> (ø)`	⬆️
garage/envs/box2d/parser/xml_types.py	`84% <100%> (ø)`	⬆️
garage/tf/envs/parallel_vec_env_executor.py	`14.28% <15.38%> (ø)`	⬆️
garage/sampler/parallel_sampler.py	`71.26% <50%> (ø)`	⬆️
garage/misc/logger.py	`43.01% <50%> (+0.21%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de91deb...83a4813. Read the comment docs.

A process known as semaphore tracker is spawned from the run_experiment process, but we cannot stop this process as it ignores SIGINT and SIGTERM and we haven't access to it. Therefore, it's removed from the list of children to wait for in both run_experiment and in test_sigint. Another process that is spawned from run_experiment is the Manager, which owns multiprocessing objects such as RLocks and counters used during the run_collect method in the StatefulPool class. This process was also making the warning message appear and the test fail. However, the manager has shutdown method to terminate the process, so we can verify the termination of the process.

ryanjulian · 2018-09-21T17:40:26Z

NICE

ghost self-assigned this Aug 17, 2018

ghost requested review from eric-heiden, CatherineSue and jonashen August 17, 2018 15:57

ghost self-requested a review as a code owner August 17, 2018 15:57

jonashen approved these changes Aug 17, 2018

View reviewed changes

jonashen reviewed Aug 17, 2018

View reviewed changes

ghost force-pushed the fix_sleeping_proc branch from f0c384c to 597ac9d Compare August 20, 2018 17:36

ryanjulian reviewed Aug 20, 2018

View reviewed changes

ghost force-pushed the fix_sleeping_proc branch from 597ac9d to 6717887 Compare August 23, 2018 03:46

ghost force-pushed the fix_sleeping_proc branch from 6717887 to b17dde5 Compare August 23, 2018 16:33

ghost force-pushed the fix_sleeping_proc branch 3 times, most recently from 2da09c0 to f93c359 Compare August 23, 2018 21:27

ghost force-pushed the fix_sleeping_proc branch 4 times, most recently from c97ef82 to ad23f72 Compare August 28, 2018 02:45

ryanjulian reviewed Aug 29, 2018

View reviewed changes

Angel Gonzalez added 7 commits September 18, 2018 13:19

Move and rename test files

49a9026

Also, some of codacy issues were solved, as well as some legacy pylint and flake8 issues.

Add SIGINT test for TensorFlow

0bcceb7

Other changes include the file renaming of instrumented policy optimizers, as well as renaming the shutdown method in plotters to close in order to close all processes under run_experiment with the same method.

Fix PEP8 issues

6ce2599

Change Enum to IntEnum

c3e41b9

Otherwise, comparing two enumerations assigned to variables does not work.

ghost force-pushed the fix_sleeping_proc branch from d3adb1a to c3e41b9 Compare September 18, 2018 20:19

ghost merged commit d5b9e63 into master Sep 21, 2018

ryanjulian deleted the fix_sleeping_proc branch September 21, 2018 17:40

ghost mentioned this pull request Sep 21, 2018

run_experiment should not leave zombie processes #120

Closed

naeioi mentioned this pull request Oct 10, 2018

Failed SIGINT integration test is not failing the build #339

Closed

This pull request was closed.


		signal.signal(signal.SIGINT, terminte_sampler)

		signal.pthread_sigmask(signal.SIG_BLOCK, [signal.SIGINT])

		@@ -21,13 +21,16 @@ class Op(Enum):

		Message = namedtuple("Message", ["op", "args", "kwargs"])

		__plotters__ = []

		@@ -186,5 +202,34 @@ def run_experiment(argv):
		logger.pop_prefix()


		def child_proc_shutdown(shutdown_sampler=False):

		@@ -0,0 +1,4 @@
		from tests.fixtures.theano.batch_polopt_instrumented import \

Fix sleeping processes #295

Fix sleeping processes #295

Conversation

ghost commented Aug 17, 2018

ryanjulian commented Aug 17, 2018

jonashen commented Aug 17, 2018

jonashen Aug 17, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonashen left a comment

Choose a reason for hiding this comment

ryanjulian commented Aug 17, 2018 • edited

ryanjulian commented Aug 17, 2018

ryanjulian commented Aug 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanjulian commented Aug 20, 2018

ghost commented Aug 21, 2018

ryanjulian commented Aug 21, 2018

ghost commented Aug 21, 2018

ryanjulian commented Aug 21, 2018

ghost commented Aug 23, 2018 • edited by ghost

ghost commented Aug 23, 2018

coveralls commented Aug 28, 2018 • edited

ryanjulian commented Aug 28, 2018

ghost commented Aug 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 30, 2018 • edited

Codecov Report

codecov bot commented Sep 18, 2018 • edited

Codecov Report

ryanjulian commented Sep 21, 2018

jonashen Aug 17, 2018 •

edited

ryanjulian commented Aug 17, 2018 •

edited

ghost commented Aug 23, 2018 •

edited by ghost

coveralls commented Aug 28, 2018 •

edited

codecov bot commented Aug 30, 2018 •

edited

codecov bot commented Sep 18, 2018 •

edited