There's a logic bug here when the Xorg server slow start #3107

bitcoffeeiux · 2021-01-21T09:27:13Z

the program triggers a timeout and executes sigalrm_handler When the Xorg service starts too slowly and exceeds the timeout time, which executes the raise exiterror causing the finally block to be executed.
The finally block will cancel processing of the SIGUSER1 signal, which will cause the signuser1 sent to Anaconda after the Xorg service is started to fail to process and the program will die

link:https://bugzilla.redhat.com/show_bug.cgi?id=1918702

VladimirSlavik · 2021-01-21T13:51:12Z

/kickstart-test --testtype smoke

VladimirSlavik · 2021-01-21T13:52:04Z

Thanks for the fix!

Do I understand this correctly? If a timeout happens, we return everything to as it was before. But the X11 process is still starting. Once it manages to start, we receive the signal, which we no longer handle. This non-handling crashes the program.

bitcoffeeiux · 2021-01-21T14:21:57Z

Thanks for the fix!

Do I understand this correctly? If a timeout happens, we return everything to as it was before. But the X11 process is still starting. Once it manages to start, we receive the signal, which we no longer handle. This non-handling crashes the program.

yes,it is

VladimirSlavik · 2021-01-21T15:13:18Z

I verified that this works exactly as it should.

Once X does start, I get a black screen and must manually switch to tty1 to see the messages. I don't know enough about this to say if that's wrong or not. However, based on the need for your changes, I think this might be a code path that is not used too much and might be somewhat out of date. Then this would need also switching to console in the places that handle exceptions from the startX function.

if conf.system.can_switch_tty:
    util.vtActivate(1)

I will ask.

VladimirSlavik · 2021-01-21T15:18:17Z

/packit build

bitcoffeeiux · 2021-01-21T15:25:56Z

I verified that this works exactly as it should.

Once X does start, I get a black screen and must manually switch to tty1 to see the messages. I don't know enough about this to say if that's wrong or not. However, based on the need for your changes, I think this might be a code path that is not used too much and might be somewhat out of date. Then this would need also switching to console in the places that handle exceptions from the startX function.
if conf.system.can_switch_tty:
    util.vtActivate(1)
I will ask.
can you see a line at the bottom of tty1?like this 'panel is dead...'

VladimirSlavik · 2021-01-21T15:34:54Z

Sorry, that was unclear. What happens in tty1 is correct, it's alive and asks for things. Your fix works. However I don't see that, I must manually switch to that (ctrl+alt+f1).

bitcoffeeiux · 2021-01-21T15:43:45Z

Sorry, that was unclear. What happens in tty1 is correct, it's alive and asks for things. Your fix works. However I don't see that, I must manually switch to that (ctrl+alt+f1).

maybe you can follow the step in this:https://bugzilla.redhat.com/show_bug.cgi?id=1918702

I set the timeout to 1 second and the process starts X11 after 0.5 seconds of sleep to make sure that X11 doesn't start up before the timeout

bitcoffeeiux · 2021-01-21T15:45:04Z

I ran into this problem in production today because I was using a network drive to install the system

VladimirSlavik · 2021-01-21T16:10:21Z

@poncovka says that the timeout can be changed to help when Xorg starts very slowly, to allow normal use. The situation where it actually times out was not considered. Therefore, the missing switch to tty1 is simply another part to fix.

VladimirSlavik · 2021-01-21T16:43:46Z

I think I figured out the tty part, the two lines from earlier go to pyanaconda.display.setup_display() like this:

        try:
            start_x11(xtimeout)
            do_startup_x11_actions()
        except (OSError, RuntimeError) as e:
            log.warning("X startup failed: %s", e)
            stdout_log.warning("X startup failed, falling back to text mode")
            anaconda.display_mode = constants.DisplayModes.TUI
            anaconda.gui_startup_failed = True
+           if conf.system.can_switch_tty:
+               util.vtActivate(1)
            time.sleep(2)

VladimirSlavik · 2021-01-21T17:06:07Z

@bitcoffeeiux , I see two things to change:

Please fix the commit message to fit our conventions: First line as description, then details max. 79 chars per line, link to the bug like Resolves: rhbz#XXX. See example below.
Add another commit with the tty switching so that people can actually see that Anaconda can continue.

If you could do that, I will be happy to approve this pull request.

Example commit message:

Handle correctly signal from delayed Xorg start

Handling of the SIGUSR1 signal after Xorg start timed out was wrong. If Xorg
eventually started and sent SIGUSR1 to Anaconda, the signal was not handled
and Anaconda crashed.

Resolves: rhbz#1918702

bitcoffeeiux · 2021-01-22T01:25:56Z

great，i will

bitcoffeeiux · 2021-01-22T02:12:35Z

Please review， thanks

poncovka · 2021-01-22T09:43:45Z

pyanaconda/core/util.py

@@ -254,7 +254,7 @@ def sigusr1_preexec():
    finally:
        # Put everything back where it was
        signal.alarm(0)
-        signal.signal(signal.SIGUSR1, old_sigusr1_handler)
+        signal.signal(signal.SIGUSR1, signal.SIG_IGN)


I am afraid that it is necessary to restore the previous signal handler.

Oh. I forgot about that part. We use SIGUSR1 to test the crash handler. But if Xorg here keeps starting and eventually starts and sends it... goodbye Anaconda. So that's actually the original bug: Xorg sends the same signal that we use for debugging, and we take that to mean "let's crash".

I guess we could try to kill childproc if we don't get X11 started in time? Timeout calls sigalrm_handler, so that would set some flag, and in finally we'd check the flag and try to kill the process.

(One alternative solution I can think of is: Ignore the first SIGUSR1 after trying to start X. But that introduces behavior that is very hard to guess when poking the program.)

Is there any other place to use this signal than a test crash accident in anaconda?

In fact, I haven't found the code that sends SIGUER1. Could I please change the SIGUER1 sent by XORG to other signals?

If I kill the Xorg process by calling terminate in finally block, is it possible that Xorg is calling IO and not responding to the signal

From the man page:

The X server attaches special meaning to the following signals: SIGHUP This signal causes the server to close all existing connections, free all re‐ sources, and restore all defaults. It is sent by the display manager whenever the main user's main application (usually an xterm or window manager) exits to force the server to clean up and prepare for the next user. SIGTERM This signal causes the server to exit cleanly. SIGUSR1 This signal is used quite differently from either of the above. When the server starts, it checks to see if it has inherited SIGUSR1 as SIG_IGN instead of the usual SIG_DFL. In this case, the server sends a SIGUSR1 to its parent process af‐ ter it has set up the various connection schemes. Xdm uses this feature to recog‐ nize when connecting to the server is possible.

It doesn't look like it is possible to change SIGUSR1. However, it might be interesting to try SIGTERM to kill the X server.

bitcoffeeiux · 2021-01-25T06:14:04Z

I tried to modify the SIGALR_Handler code, but it was not very good, because if the X server was running too slowly and did not terminate after the end of the sleep, the anaconda would still crash.

    def sigalrm_handler(num, frame):
    # Check that it didn't make it under the wire
    if x11_started[0]:
        return
    log.error("Timeout trying to start %s", argv[0])
    if x11_started[0] is False and childproc is not None:
        WatchProcesses.unwatch_process(childproc)
        childproc.terminate()
        time.sleep(3)
    raise ExitError("Timeout trying to start %s" % argv[0])

VladimirSlavik · 2021-01-25T14:01:02Z

Ouch. So, if I'm reading it right...

X will send SIGUSR1. I did not find anything to configure it differently. (manpages: xserver, xorg, xorg.conf)
Once we attempt to start X, we can't reliably prevent receiving SIGUSR1 from it. ("If I kill the Xorg process by calling terminate in finally block, is it possible that Xorg is calling IO and not responding to the signal")
We use SIGUSR1 to test our exception handler. (Fake crash.)
We do two different things when we get SIGUSR1: Normally it tests the exception handler, but when starting X, it means X started successfully.
If Xorg is too slow to start, we switch the meaning of SIGUSR1 back to exception testing, and if X is not stuck, then it kills us with is "I'm ok" report.
This pull request exists because the "killed by late Xorg start" actually did happen to an user.
We can't give up the exception handler testing with a signal - both Fedora and RHEL QE use it a lot.
SIGUSR2 is already taken by a similar debugging functionality which does not terminate Anaconda, only saves a traceback.

One way out would be to test exceptions with a different signal. Not sure which - SIGSTOP, SIGRTMIN+???... This requires more discussion within team, unfortunately.

Either way, any changes to this need more discussion, sorry @bitcoffeeiux :( Still, thank you for bringing this to our attention. I will get back to this once we know what we want to do.

In fact, I haven't found the code that sends SIGUER1.

Developers send it to test exception handler. In anaconda.py around line 600:

    # add our own additional signal handlers
    signal.signal(signal.SIGUSR1, lambda signum, frame:
                  exception.test_exception_handling())
    signal.signal(signal.SIGUSR2, lambda signum, frame: anaconda.dumpState())

bitcoffeeiux · 2021-01-25T14:05:42Z

OK, I will continue to wait for the news. Is this issue only open? Or close?

VladimirSlavik · 2021-01-25T14:46:07Z

I will keep this as open, so that everyone knows this should be revisited. ...or that's the plan, anyway!

VladimirSlavik · 2021-01-25T15:59:44Z

So, after discussing this, @poncovka came up with the idea to keep the handler for X11 until it gets something. That interferes with exception testing a bit, but only in the case where X11 failed. I've merged all the changes into one commit - see here: VladimirSlavik@82ff9fb

@bitcoffeeiux, if you want to make this your change, feel free to do so. Or if you want, I can do a new PR with the aggregated changes.

@jstodola, your thoughts? This would change the behavior a bit, even if only in case of a failure. (See linked commit message for details.)

jstodola · 2021-01-25T19:26:17Z

@VladimirSlavik , the change you proposed looks reasonable to me. Really, it's a corner case, so sending the SIGUSR1 twice is fine if someone really needs to do it.

bitcoffeeiux · 2021-01-26T03:39:52Z

There is also the question, if this modification, Xorg does not send a SIGUSER1 signal if it fails to start, then the first signal anaconda does not respond when using the SIGUSR1 signal for crash testing? Is this modified too much for the crash test?

VladimirSlavik · 2021-01-26T10:34:42Z

Yes, that is true. The consensus is that it is an even more rare condition, so it is a good tradeoff.

bitcoffeeiux · 2021-01-26T11:46:44Z

OK, I will revise it again and submit it as soon as possible

jstodola · 2021-01-26T15:44:49Z

It doesn't have to be part of this PR, but anaconda should print hints how to workaround the X server timeout if the timeout is reached.
For example the following boot arguments can help for installation from slow media:

rd.live.ram (loads stage2 into memory)
inst.xtimeout=
inst.text (to avoid even trying to start GUI)

bitcoffeeiux changed the title ~~There's a logic bug here.~~ There's a logic bug here when the Xorg server slow start Jan 21, 2021

VladimirSlavik added the master Please, use the `f39` label instead. label Jan 21, 2021

VladimirSlavik self-assigned this Jan 21, 2021

bitcoffeeiux closed this Jan 22, 2021

bitcoffeeiux reopened this Jan 22, 2021

poncovka reviewed Jan 22, 2021

View reviewed changes

VladimirSlavik added the blocked Don't merge this pull request! label Jan 25, 2021

VladimirSlavik removed the blocked Don't merge this pull request! label Jan 25, 2021

bitcoffeeiux closed this Jan 30, 2021

bitcoffeeiux reopened this Jan 30, 2021

bitcoffeeiux closed this Jan 30, 2021

bitcoffeeiux mentioned this pull request Jan 30, 2021

Handle correctly signal from delayed Xorg start(#1918702) #3132

Closed

VladimirSlavik mentioned this pull request Feb 3, 2021

Handle correctly too slow X11 start #3141

Closed

This was referenced Apr 16, 2021

Add a shim script to translate signals from Xorg #3295

Closed

Xorg timeout give up exception handler test #3304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There's a logic bug here when the Xorg server slow start #3107

There's a logic bug here when the Xorg server slow start #3107

bitcoffeeiux commented Jan 21, 2021 •

edited

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021 •

edited

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 21, 2021 •

edited

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 22, 2021 •

edited

bitcoffeeiux commented Jan 22, 2021

poncovka Jan 22, 2021

VladimirSlavik Jan 22, 2021

VladimirSlavik Jan 22, 2021

bitcoffeeiux Jan 22, 2021 •

edited

bitcoffeeiux Jan 25, 2021

bitcoffeeiux Jan 25, 2021

poncovka Jan 26, 2021

bitcoffeeiux commented Jan 25, 2021 •

edited

VladimirSlavik commented Jan 25, 2021

bitcoffeeiux commented Jan 25, 2021

VladimirSlavik commented Jan 25, 2021 •

edited

VladimirSlavik commented Jan 25, 2021

jstodola commented Jan 25, 2021

bitcoffeeiux commented Jan 26, 2021

VladimirSlavik commented Jan 26, 2021

bitcoffeeiux commented Jan 26, 2021

jstodola commented Jan 26, 2021

There's a logic bug here when the Xorg server slow start #3107

There's a logic bug here when the Xorg server slow start #3107

Conversation

bitcoffeeiux commented Jan 21, 2021 • edited

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021 • edited

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 21, 2021 • edited

bitcoffeeiux commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

VladimirSlavik commented Jan 21, 2021

bitcoffeeiux commented Jan 22, 2021 • edited

bitcoffeeiux commented Jan 22, 2021

poncovka Jan 22, 2021

Choose a reason for hiding this comment

VladimirSlavik Jan 22, 2021

Choose a reason for hiding this comment

VladimirSlavik Jan 22, 2021

Choose a reason for hiding this comment

bitcoffeeiux Jan 22, 2021 • edited

Choose a reason for hiding this comment

bitcoffeeiux Jan 25, 2021

Choose a reason for hiding this comment

bitcoffeeiux Jan 25, 2021

Choose a reason for hiding this comment

poncovka Jan 26, 2021

Choose a reason for hiding this comment

bitcoffeeiux commented Jan 25, 2021 • edited

VladimirSlavik commented Jan 25, 2021

bitcoffeeiux commented Jan 25, 2021

VladimirSlavik commented Jan 25, 2021 • edited

VladimirSlavik commented Jan 25, 2021

jstodola commented Jan 25, 2021

bitcoffeeiux commented Jan 26, 2021

VladimirSlavik commented Jan 26, 2021

bitcoffeeiux commented Jan 26, 2021

jstodola commented Jan 26, 2021

bitcoffeeiux commented Jan 21, 2021 •

edited

VladimirSlavik commented Jan 21, 2021 •

edited

bitcoffeeiux commented Jan 21, 2021 •

edited

bitcoffeeiux commented Jan 22, 2021 •

edited

bitcoffeeiux Jan 22, 2021 •

edited

bitcoffeeiux commented Jan 25, 2021 •

edited

VladimirSlavik commented Jan 25, 2021 •

edited