-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging of bus.shutdown() #115
Comments
I am looking into it now. There seems to be an impedance mismatch between what we log (messages in streams) and what we use at runtime (one queue per Node with messages from multiple streams). It would be natural to log shutdown at queue level but there is no such thing in the log file. In the larger scheme of things we want to be able to detect when the log file is not complete (see #241). For that using service channel 0 is ok I think. However. How to best replay logs where the program has ended by using Ctrl-C or something similar (=asynchronous, not logged)? We might have to first handle this case - for context see https://vorpus.org/blog/control-c-handling-in-python-and-trio/ and https://docs.python.org/3/library/signal.html. From the documentation:
We need to set our own signal handler that notes globally that a shutdown was requested and also logs it. Each thread then needs to periodically check this flag. During replay we just recover the value of this flag at the right (simulated) time from the log. While we are at it, I think it would be beneficial to allow for 2 stage shutdown. First we let the nodes know that shutdown was requested but we do not stop the communication right away. Instead we wait (with a timeout) for the communication to stop and then we exit. The benefit would be that we should be able for example to stop the robot or do other state-preserving actions, without relying on communication timeouts and defaults. So to summarize: we need to first properly handle asynchronous Ctrl+C and the rest will somehow follow by itself. |
See https://christopherdavis.me/blog/threading-basics.html some other examples. |
Lines 65 to 81 in ebbf1f4
record - often Recorder is used instead as it predates record .
So, maybe Lines 630 to 636 in ebbf1f4
with record(config=config['robot'], prefix=prefix, application=SubTChallenge) This should allow us to move from using There is one more quirk to think about - when there is/isn't an application and when the said application is/isn't a Thread instance. What if we forbid application to be a Thread instance? Grr. Node is a Thread (#124). I am afraid that this technical debt is going to eat us alive 😟 |
Ok, so the plan could be:
|
I almost got to a point where Node was not a Thread and all threads were created in Recorder. The structure was such as it would allow easy replacement of Thread by a Process from multiprocessing. However I've failed at a point where slots are defined as by definition you cannot call a method on an object living in another process. So... there should be one more item at the beginning of the plan - what to do with slots? Without them, there were some problems. We hot-fixed them by adding slots. But we were (are?) not sure what was really the problem 😩. It seems that the Threads were not switching fast enough. So maybe time.sleep(0.000001) would be enough? Or maybe we should allow multiple nodes to live in one Thread - meaning they would share the message queue (actually BusHandler) and we would have to note in the queue item which node is the destination. This would decrease the dependency on the process scheduler, which is a good thing I think. It would also allow us to decrease the number of Threads in the system, which is also a good thing. But OMG - I am terrified how all this stuff is intertwined together and you cannot fix one thing before fixing some other before fixing some other... I'll give it a rest for a while to think about it some more 😓 |
I think that if you accept Ctrl+C, i.e. "terminal input" then there should be dedicated node "terminal" (probably not only stdin, but also stdout). Ctrl+C is fine with CLI interface, but we should/would like to to GUI maybe with "Shutdown" button?? So I would probably handle it not per whole application but per "terminal node". |
I am not sure I follow. Are we still talking about logging of shutdown? Handling SIGINT signal was/is necessary precondition for logging and replay to be successful. Because not handling it is a source of non-determinism: the KeyboardInterrupt exception that python by generates in the default SIGINT handler can cause the main thread to exit at any point of the execution and since we are not able to recreate that during replay, we must not allow that to happen during record. That we would like possibly like to end applications by some other means is a valid, though orthogonal, issue to both logging of shutdown and handling of SIGINT signal. Or have I misunderstood your note? |
OK, Ctrl+C is in my case typically failure of application, infinite loop, robot almost hitting the wall ... so it is kind of "hard kill", and maybe for "nice shutdown" just "press any key" would be more suitable. But these are probably two different things - first of all I would like to see that application was finished properly, which could be record in channel 0, as you wrote above. I would postpone this discussion for a week (note, that it was opened year ago). |
So you are actually responding to this comment, right?
We should probably track it as a separate issue. But for the sake of faster understanding, could you please use quotes next time? BTW: my status on this issue is this:
and #243 is something I've been able to come up with afterwards. So there is not much of a hurry. |
We discussed the need to log termination of the program run, which was initiated by
bus.shutdown
(typically fromNode.request_stop()
). The question is if every stream in the log file should have "terminal" mark or we should use service channel 0 instead? It is triggered for all modules in the same time and we have log file ordered by timestamp ... so one common mark should be enough. On the other hand this means that whenever we replay individual node we have to take into account also channel 0 (and add there some structure to distinguish error reports, stored config etc). Any suggestion?This issue is related to #110.
The text was updated successfully, but these errors were encountered: