Logging of bus.shutdown() #115

m3d · 2018-11-23T05:42:07Z

We discussed the need to log termination of the program run, which was initiated by bus.shutdown (typically from Node.request_stop()). The question is if every stream in the log file should have "terminal" mark or we should use service channel 0 instead? It is triggered for all modules in the same time and we have log file ordered by timestamp ... so one common mark should be enough. On the other hand this means that whenever we replay individual node we have to take into account also channel 0 (and add there some structure to distinguish error reports, stored config etc). Any suggestion?
This issue is related to #110.

The text was updated successfully, but these errors were encountered:

zbynekwinkler · 2019-09-26T14:20:11Z

I am looking into it now. There seems to be an impedance mismatch between what we log (messages in streams) and what we use at runtime (one queue per Node with messages from multiple streams). It would be natural to log shutdown at queue level but there is no such thing in the log file.

In the larger scheme of things we want to be able to detect when the log file is not complete (see #241). For that using service channel 0 is ok I think.

However. How to best replay logs where the program has ended by using Ctrl-C or something similar (=asynchronous, not logged)? We might have to first handle this case - for context see https://vorpus.org/blog/control-c-handling-in-python-and-trio/ and https://docs.python.org/3/library/signal.html.

From the documentation:

Python signal handlers are always executed in the main Python thread

only the main thread is allowed to set a new signal handler

We need to set our own signal handler that notes globally that a shutdown was requested and also logs it. Each thread then needs to periodically check this flag. During replay we just recover the value of this flag at the right (simulated) time from the log.

While we are at it, I think it would be beneficial to allow for 2 stage shutdown. First we let the nodes know that shutdown was requested but we do not stop the communication right away. Instead we wait (with a timeout) for the communication to stop and then we exit. The benefit would be that we should be able for example to stop the robot or do other state-preserving actions, without relying on communication timeouts and defaults.

So to summarize: we need to first properly handle asynchronous Ctrl+C and the rest will somehow follow by itself.

zbynekwinkler · 2019-09-26T14:40:08Z

See https://christopherdavis.me/blog/threading-basics.html some other examples.

zbynekwinkler · 2019-09-27T13:50:36Z

record is already catching KeyboardInterrupt - see

osgar/osgar/record.py

Lines 65 to 81 in ebbf1f4

    
           try: 
        
               log.write(0, bytes(str(config), 'ascii'))  # write configuration 
        
               recorder = Recorder(config=config['robot'], logger=log, application=application) 
        
               recorder.start() 
        
               if application is not None: 
        
                   app = recorder.modules['app']  # TODO nicer reference 
        
                   app.join()  # wait for application termination 
        
               else: 
        
                   if duration_sec is None: 
        
                       while True: 
        
                           time.sleep(1.0) 
        
                   else: 
        
                       time.sleep(duration_sec) 
        
           except KeyboardInterrupt: 
        
               print("record() caught KeyboardInterrupt") 
        
           finally: 
        
               recorder.finish()

- so we are partially there. But not everything is using record - often Recorder is used instead as it predates record.

So, maybe record should not take config_filename as a parameter but already built config dictionary? So maybe we should replace

osgar/subt/main.py

Lines 630 to 636 in ebbf1f4

    
           with LogWriter(prefix=prefix, note=str(sys.argv)) as log: 
        
               log.write(0, bytes(str(config), 'ascii'))  # write configuration 
        
               robot = Recorder(config=config['robot'], logger=log, application=SubTChallenge) 
        
               app = robot.modules['app']  # TODO nicer reference 
        
               robot.start() 
        
               app.play() 
        
               robot.finish()

with

        record(config=config['robot'], prefix=prefix, application=SubTChallenge)

This should allow us to move from using Recorder to using record everywhere, which would allow us to handle shutdown properly.

There is one more quirk to think about - when there is/isn't an application and when the said application is/isn't a Thread instance. What if we forbid application to be a Thread instance? Grr. Node is a Thread (#124). I am afraid that this technical debt is going to eat us alive 😟

zbynekwinkler · 2019-09-27T14:30:32Z

Ok, so the plan could be:

Make Node not a Thread Node should not inherit from Thread #124, instead create a Thread in Recorder
Once Threads come from outside to Node, it is easy to handle the case with application so we can replace all Recorder usages with record - maybe it will lead itself to even using launch everywhere
Only after that we can add threading.Event to record when SIGINT to replace the asynchronous nature ofKeyboardInterrupt and therefore logging of shutdown

zbynekwinkler · 2019-09-27T15:36:40Z

I almost got to a point where Node was not a Thread and all threads were created in Recorder. The structure was such as it would allow easy replacement of Thread by a Process from multiprocessing. However I've failed at a point where slots are defined as by definition you cannot call a method on an object living in another process.

So... there should be one more item at the beginning of the plan - what to do with slots? Without them, there were some problems. We hot-fixed them by adding slots. But we were (are?) not sure what was really the problem 😩. It seems that the Threads were not switching fast enough.

So maybe time.sleep(0.000001) would be enough? Or maybe we should allow multiple nodes to live in one Thread - meaning they would share the message queue (actually BusHandler) and we would have to note in the queue item which node is the destination. This would decrease the dependency on the process scheduler, which is a good thing I think. It would also allow us to decrease the number of Threads in the system, which is also a good thing.

But OMG - I am terrified how all this stuff is intertwined together and you cannot fix one thing before fixing some other before fixing some other...

I'll give it a rest for a while to think about it some more 😓

m3d · 2019-09-30T12:09:51Z

I think that if you accept Ctrl+C, i.e. "terminal input" then there should be dedicated node "terminal" (probably not only stdin, but also stdout). Ctrl+C is fine with CLI interface, but we should/would like to to GUI maybe with "Shutdown" button?? So I would probably handle it not per whole application but per "terminal node".

zbynekwinkler · 2019-09-30T12:17:39Z

I am not sure I follow. Are we still talking about logging of shutdown?

Handling SIGINT signal was/is necessary precondition for logging and replay to be successful. Because not handling it is a source of non-determinism: the KeyboardInterrupt exception that python by generates in the default SIGINT handler can cause the main thread to exit at any point of the execution and since we are not able to recreate that during replay, we must not allow that to happen during record.

That we would like possibly like to end applications by some other means is a valid, though orthogonal, issue to both logging of shutdown and handling of SIGINT signal.

Or have I misunderstood your note?

m3d · 2019-09-30T13:56:06Z

OK, Ctrl+C is in my case typically failure of application, infinite loop, robot almost hitting the wall ... so it is kind of "hard kill", and maybe for "nice shutdown" just "press any key" would be more suitable. But these are probably two different things - first of all I would like to see that application was finished properly, which could be record in channel 0, as you wrote above. I would postpone this discussion for a week (note, that it was opened year ago).

zbynekwinkler · 2019-09-30T14:41:11Z

So you are actually responding to this comment, right?

While we are at it, I think it would be beneficial to allow for 2 stage shutdown. First we let the nodes know that shutdown was requested but we do not stop the communication right away. Instead we wait (with a timeout) for the communication to stop and then we exit. The benefit would be that we should be able for example to stop the robot or do other state-preserving actions, without relying on communication timeouts and defaults.

We should probably track it as a separate issue. But for the sake of faster understanding, could you please use quotes next time?

BTW: my status on this issue is this:

I'll give it a rest for a while to think about it some more 😓

and #243 is something I've been able to come up with afterwards. So there is not much of a hurry.

m3d added the question label Nov 23, 2018

m3d assigned zbynekwinkler Nov 23, 2018

m3d mentioned this issue Dec 6, 2018

Support multi-file logfiles #118

Open

zbynekwinkler mentioned this issue Sep 26, 2019

Proper threads join on termination #36

Open

This was referenced Sep 27, 2019

Allow multiple Nodes in one Thread? #242

Open

Handle SIGINT gracefully. #243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging of bus.shutdown() #115

Logging of bus.shutdown() #115

m3d commented Nov 23, 2018

zbynekwinkler commented Sep 26, 2019

zbynekwinkler commented Sep 26, 2019

zbynekwinkler commented Sep 27, 2019

zbynekwinkler commented Sep 27, 2019

zbynekwinkler commented Sep 27, 2019

m3d commented Sep 30, 2019

zbynekwinkler commented Sep 30, 2019

m3d commented Sep 30, 2019

zbynekwinkler commented Sep 30, 2019

Logging of bus.shutdown() #115

Logging of bus.shutdown() #115

Comments

m3d commented Nov 23, 2018

zbynekwinkler commented Sep 26, 2019

zbynekwinkler commented Sep 26, 2019

zbynekwinkler commented Sep 27, 2019

zbynekwinkler commented Sep 27, 2019

zbynekwinkler commented Sep 27, 2019

m3d commented Sep 30, 2019

zbynekwinkler commented Sep 30, 2019

m3d commented Sep 30, 2019

zbynekwinkler commented Sep 30, 2019