New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There's a logic bug here when the Xorg server slow start #3107
Conversation
/kickstart-test --testtype smoke |
Thanks for the fix! Do I understand this correctly? If a timeout happens, we return everything to as it was before. But the X11 process is still starting. Once it manages to start, we receive the signal, which we no longer handle. This non-handling crashes the program. |
yes,it is |
I verified that this works exactly as it should. Once X does start, I get a black screen and must manually switch to tty1 to see the messages. I don't know enough about this to say if that's wrong or not. However, based on the need for your changes, I think this might be a code path that is not used too much and might be somewhat out of date. Then this would need also switching to console in the places that handle exceptions from the
I will ask. |
/packit build |
|
Sorry, that was unclear. What happens in tty1 is correct, it's alive and asks for things. Your fix works. However I don't see that, I must manually switch to that (ctrl+alt+f1). |
maybe you can follow the step in this:https://bugzilla.redhat.com/show_bug.cgi?id=1918702 I set the timeout to 1 second and the process starts X11 after 0.5 seconds of sleep to make sure that X11 doesn't start up before the timeout |
I ran into this problem in production today because I was using a network drive to install the system |
@poncovka says that the timeout can be changed to help when Xorg starts very slowly, to allow normal use. The situation where it actually times out was not considered. Therefore, the missing switch to tty1 is simply another part to fix. |
I think I figured out the tty part, the two lines from earlier go to
|
@bitcoffeeiux , I see two things to change:
If you could do that, I will be happy to approve this pull request. Example commit message:
|
great,i will |
Please review, thanks |
pyanaconda/core/util.py
Outdated
@@ -254,7 +254,7 @@ def sigusr1_preexec(): | |||
finally: | |||
# Put everything back where it was | |||
signal.alarm(0) | |||
signal.signal(signal.SIGUSR1, old_sigusr1_handler) | |||
signal.signal(signal.SIGUSR1, signal.SIG_IGN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that it is necessary to restore the previous signal handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I forgot about that part. We use SIGUSR1 to test the crash handler. But if Xorg here keeps starting and eventually starts and sends it... goodbye Anaconda. So that's actually the original bug: Xorg sends the same signal that we use for debugging, and we take that to mean "let's crash".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could try to kill childproc
if we don't get X11 started in time? Timeout calls sigalrm_handler
, so that would set some flag, and in finally
we'd check the flag and try to kill the process.
(One alternative solution I can think of is: Ignore the first SIGUSR1 after trying to start X. But that introduces behavior that is very hard to guess when poking the program.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other place to use this signal than a test crash accident in anaconda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I haven't found the code that sends SIGUER1. Could I please change the SIGUER1 sent by XORG to other signals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I kill the Xorg process by calling terminate in finally block, is it possible that Xorg is calling IO and not responding to the signal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the man page:
The X server attaches special meaning to the following signals:
SIGHUP This signal causes the server to close all existing connections, free all re‐
sources, and restore all defaults. It is sent by the display manager whenever the
main user's main application (usually an xterm or window manager) exits to force
the server to clean up and prepare for the next user.
SIGTERM This signal causes the server to exit cleanly.
SIGUSR1 This signal is used quite differently from either of the above. When the server
starts, it checks to see if it has inherited SIGUSR1 as SIG_IGN instead of the
usual SIG_DFL. In this case, the server sends a SIGUSR1 to its parent process af‐
ter it has set up the various connection schemes. Xdm uses this feature to recog‐
nize when connecting to the server is possible.
It doesn't look like it is possible to change SIGUSR1. However, it might be interesting to try SIGTERM to kill the X server.
I tried to modify the SIGALR_Handler code, but it was not very good, because if the X server was running too slowly and did not terminate after the end of the sleep, the anaconda would still crash.
|
Ouch. So, if I'm reading it right...
One way out would be to test exceptions with a different signal. Not sure which - SIGSTOP, SIGRTMIN+???... This requires more discussion within team, unfortunately. Either way, any changes to this need more discussion, sorry @bitcoffeeiux :( Still, thank you for bringing this to our attention. I will get back to this once we know what we want to do.
Developers send it to test exception handler. In anaconda.py around line 600:
|
OK, I will continue to wait for the news. Is this issue only open? Or close? |
I will keep this as open, so that everyone knows this should be revisited. ...or that's the plan, anyway! |
So, after discussing this, @poncovka came up with the idea to keep the handler for X11 until it gets something. That interferes with exception testing a bit, but only in the case where X11 failed. I've merged all the changes into one commit - see here: VladimirSlavik@82ff9fb @bitcoffeeiux, if you want to make this your change, feel free to do so. Or if you want, I can do a new PR with the aggregated changes. @jstodola, your thoughts? This would change the behavior a bit, even if only in case of a failure. (See linked commit message for details.) |
@VladimirSlavik , the change you proposed looks reasonable to me. Really, it's a corner case, so sending the SIGUSR1 twice is fine if someone really needs to do it. |
There is also the question, if this modification, Xorg does not send a SIGUSER1 signal if it fails to start, then the first signal anaconda does not respond when using the SIGUSR1 signal for crash testing? Is this modified too much for the crash test? |
Yes, that is true. The consensus is that it is an even more rare condition, so it is a good tradeoff. |
OK, I will revise it again and submit it as soon as possible |
It doesn't have to be part of this PR, but anaconda should print hints how to workaround the X server timeout if the timeout is reached. rd.live.ram (loads stage2 into memory) |
the program triggers a timeout and executes sigalrm_handler When the Xorg service starts too slowly and exceeds the timeout time, which executes the raise exiterror causing the finally block to be executed.
The finally block will cancel processing of the SIGUSER1 signal, which will cause the signuser1 sent to Anaconda after the Xorg service is started to fail to process and the program will die
link:https://bugzilla.redhat.com/show_bug.cgi?id=1918702