Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging of bus.shutdown() #115

Open
m3d opened this issue Nov 23, 2018 · 9 comments
Open

Logging of bus.shutdown() #115

m3d opened this issue Nov 23, 2018 · 9 comments
Assignees
Labels

Comments

@m3d
Copy link
Member

m3d commented Nov 23, 2018

We discussed the need to log termination of the program run, which was initiated by bus.shutdown (typically from Node.request_stop()). The question is if every stream in the log file should have "terminal" mark or we should use service channel 0 instead? It is triggered for all modules in the same time and we have log file ordered by timestamp ... so one common mark should be enough. On the other hand this means that whenever we replay individual node we have to take into account also channel 0 (and add there some structure to distinguish error reports, stored config etc). Any suggestion?
This issue is related to #110.

@zbynekwinkler
Copy link
Member

I am looking into it now. There seems to be an impedance mismatch between what we log (messages in streams) and what we use at runtime (one queue per Node with messages from multiple streams). It would be natural to log shutdown at queue level but there is no such thing in the log file.

In the larger scheme of things we want to be able to detect when the log file is not complete (see #241). For that using service channel 0 is ok I think.

However. How to best replay logs where the program has ended by using Ctrl-C or something similar (=asynchronous, not logged)? We might have to first handle this case - for context see https://vorpus.org/blog/control-c-handling-in-python-and-trio/ and https://docs.python.org/3/library/signal.html.

From the documentation:

Python signal handlers are always executed in the main Python thread

only the main thread is allowed to set a new signal handler

We need to set our own signal handler that notes globally that a shutdown was requested and also logs it. Each thread then needs to periodically check this flag. During replay we just recover the value of this flag at the right (simulated) time from the log.

While we are at it, I think it would be beneficial to allow for 2 stage shutdown. First we let the nodes know that shutdown was requested but we do not stop the communication right away. Instead we wait (with a timeout) for the communication to stop and then we exit. The benefit would be that we should be able for example to stop the robot or do other state-preserving actions, without relying on communication timeouts and defaults.

So to summarize: we need to first properly handle asynchronous Ctrl+C and the rest will somehow follow by itself.

@zbynekwinkler
Copy link
Member

See https://christopherdavis.me/blog/threading-basics.html some other examples.

@zbynekwinkler
Copy link
Member

record is already catching KeyboardInterrupt - see

osgar/osgar/record.py

Lines 65 to 81 in ebbf1f4

try:
log.write(0, bytes(str(config), 'ascii')) # write configuration
recorder = Recorder(config=config['robot'], logger=log, application=application)
recorder.start()
if application is not None:
app = recorder.modules['app'] # TODO nicer reference
app.join() # wait for application termination
else:
if duration_sec is None:
while True:
time.sleep(1.0)
else:
time.sleep(duration_sec)
except KeyboardInterrupt:
print("record() caught KeyboardInterrupt")
finally:
recorder.finish()
- so we are partially there. But not everything is using record - often Recorder is used instead as it predates record.

So, maybe record should not take config_filename as a parameter but already built config dictionary? So maybe we should replace

osgar/subt/main.py

Lines 630 to 636 in ebbf1f4

with LogWriter(prefix=prefix, note=str(sys.argv)) as log:
log.write(0, bytes(str(config), 'ascii')) # write configuration
robot = Recorder(config=config['robot'], logger=log, application=SubTChallenge)
app = robot.modules['app'] # TODO nicer reference
robot.start()
app.play()
robot.finish()

with

        record(config=config['robot'], prefix=prefix, application=SubTChallenge)

This should allow us to move from using Recorder to using record everywhere, which would allow us to handle shutdown properly.

There is one more quirk to think about - when there is/isn't an application and when the said application is/isn't a Thread instance. What if we forbid application to be a Thread instance? Grr. Node is a Thread (#124). I am afraid that this technical debt is going to eat us alive 😟

@zbynekwinkler
Copy link
Member

Ok, so the plan could be:

  • Make Node not a Thread Node should not inherit from Thread #124, instead create a Thread in Recorder
  • Once Threads come from outside to Node, it is easy to handle the case with application so we can replace all Recorder usages with record - maybe it will lead itself to even using launch everywhere
  • Only after that we can add threading.Event to record when SIGINT to replace the asynchronous nature ofKeyboardInterrupt and therefore logging of shutdown

@zbynekwinkler
Copy link
Member

I almost got to a point where Node was not a Thread and all threads were created in Recorder. The structure was such as it would allow easy replacement of Thread by a Process from multiprocessing. However I've failed at a point where slots are defined as by definition you cannot call a method on an object living in another process.

So... there should be one more item at the beginning of the plan - what to do with slots? Without them, there were some problems. We hot-fixed them by adding slots. But we were (are?) not sure what was really the problem 😩. It seems that the Threads were not switching fast enough.

So maybe time.sleep(0.000001) would be enough? Or maybe we should allow multiple nodes to live in one Thread - meaning they would share the message queue (actually BusHandler) and we would have to note in the queue item which node is the destination. This would decrease the dependency on the process scheduler, which is a good thing I think. It would also allow us to decrease the number of Threads in the system, which is also a good thing.

But OMG - I am terrified how all this stuff is intertwined together and you cannot fix one thing before fixing some other before fixing some other...

I'll give it a rest for a while to think about it some more 😓

@m3d
Copy link
Member Author

m3d commented Sep 30, 2019

I think that if you accept Ctrl+C, i.e. "terminal input" then there should be dedicated node "terminal" (probably not only stdin, but also stdout). Ctrl+C is fine with CLI interface, but we should/would like to to GUI maybe with "Shutdown" button?? So I would probably handle it not per whole application but per "terminal node".

@zbynekwinkler
Copy link
Member

I am not sure I follow. Are we still talking about logging of shutdown?

Handling SIGINT signal was/is necessary precondition for logging and replay to be successful. Because not handling it is a source of non-determinism: the KeyboardInterrupt exception that python by generates in the default SIGINT handler can cause the main thread to exit at any point of the execution and since we are not able to recreate that during replay, we must not allow that to happen during record.

That we would like possibly like to end applications by some other means is a valid, though orthogonal, issue to both logging of shutdown and handling of SIGINT signal.

Or have I misunderstood your note?

@m3d
Copy link
Member Author

m3d commented Sep 30, 2019

OK, Ctrl+C is in my case typically failure of application, infinite loop, robot almost hitting the wall ... so it is kind of "hard kill", and maybe for "nice shutdown" just "press any key" would be more suitable. But these are probably two different things - first of all I would like to see that application was finished properly, which could be record in channel 0, as you wrote above. I would postpone this discussion for a week (note, that it was opened year ago).

@zbynekwinkler
Copy link
Member

So you are actually responding to this comment, right?

While we are at it, I think it would be beneficial to allow for 2 stage shutdown. First we let the nodes know that shutdown was requested but we do not stop the communication right away. Instead we wait (with a timeout) for the communication to stop and then we exit. The benefit would be that we should be able for example to stop the robot or do other state-preserving actions, without relying on communication timeouts and defaults.

We should probably track it as a separate issue. But for the sake of faster understanding, could you please use quotes next time?

BTW: my status on this issue is this:

I'll give it a rest for a while to think about it some more 😓

and #243 is something I've been able to come up with afterwards. So there is not much of a hurry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants