Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There's a logic bug here when the Xorg server slow start #3107

Closed
wants to merge 0 commits into from
Closed

There's a logic bug here when the Xorg server slow start #3107

wants to merge 0 commits into from

Conversation

bitcoffeeiux
Copy link
Contributor

@bitcoffeeiux bitcoffeeiux commented Jan 21, 2021

the program triggers a timeout and executes sigalrm_handler When the Xorg service starts too slowly and exceeds the timeout time, which executes the raise exiterror causing the finally block to be executed.
The finally block will cancel processing of the SIGUSER1 signal, which will cause the signuser1 sent to Anaconda after the Xorg service is started to fail to process and the program will die

link:https://bugzilla.redhat.com/show_bug.cgi?id=1918702

@bitcoffeeiux bitcoffeeiux changed the title There's a logic bug here. There's a logic bug here when the Xorg server slow start Jan 21, 2021
@VladimirSlavik VladimirSlavik added the master Please, use the `f39` label instead. label Jan 21, 2021
@VladimirSlavik
Copy link
Contributor

/kickstart-test --testtype smoke

@VladimirSlavik
Copy link
Contributor

VladimirSlavik commented Jan 21, 2021

Thanks for the fix!

Do I understand this correctly? If a timeout happens, we return everything to as it was before. But the X11 process is still starting. Once it manages to start, we receive the signal, which we no longer handle. This non-handling crashes the program.

@bitcoffeeiux
Copy link
Contributor Author

Thanks for the fix!

Do I understand this correctly? If a timeout happens, we return everything to as it was before. But the X11 process is still starting. Once it manages to start, we receive the signal, which we no longer handle. This non-handling crashes the program.

yes,it is

@VladimirSlavik
Copy link
Contributor

I verified that this works exactly as it should.

Once X does start, I get a black screen and must manually switch to tty1 to see the messages. I don't know enough about this to say if that's wrong or not. However, based on the need for your changes, I think this might be a code path that is not used too much and might be somewhat out of date. Then this would need also switching to console in the places that handle exceptions from the startX function.

if conf.system.can_switch_tty:
    util.vtActivate(1)

I will ask.

@VladimirSlavik
Copy link
Contributor

/packit build

@bitcoffeeiux
Copy link
Contributor Author

I verified that this works exactly as it should.

Once X does start, I get a black screen and must manually switch to tty1 to see the messages. I don't know enough about this to say if that's wrong or not. However, based on the need for your changes, I think this might be a code path that is not used too much and might be somewhat out of date. Then this would need also switching to console in the places that handle exceptions from the startX function.

if conf.system.can_switch_tty:
    util.vtActivate(1)

I will ask.
can you see a line at the bottom of tty1?like this 'panel is dead...'

@VladimirSlavik
Copy link
Contributor

Sorry, that was unclear. What happens in tty1 is correct, it's alive and asks for things. Your fix works. However I don't see that, I must manually switch to that (ctrl+alt+f1).

@bitcoffeeiux
Copy link
Contributor Author

bitcoffeeiux commented Jan 21, 2021

Sorry, that was unclear. What happens in tty1 is correct, it's alive and asks for things. Your fix works. However I don't see that, I must manually switch to that (ctrl+alt+f1).

maybe you can follow the step in this:https://bugzilla.redhat.com/show_bug.cgi?id=1918702

I set the timeout to 1 second and the process starts X11 after 0.5 seconds of sleep to make sure that X11 doesn't start up before the timeout

@bitcoffeeiux
Copy link
Contributor Author

I ran into this problem in production today because I was using a network drive to install the system

@VladimirSlavik
Copy link
Contributor

@poncovka says that the timeout can be changed to help when Xorg starts very slowly, to allow normal use. The situation where it actually times out was not considered. Therefore, the missing switch to tty1 is simply another part to fix.

@VladimirSlavik
Copy link
Contributor

I think I figured out the tty part, the two lines from earlier go to pyanaconda.display.setup_display() like this:

        try:
            start_x11(xtimeout)
            do_startup_x11_actions()
        except (OSError, RuntimeError) as e:
            log.warning("X startup failed: %s", e)
            stdout_log.warning("X startup failed, falling back to text mode")
            anaconda.display_mode = constants.DisplayModes.TUI
            anaconda.gui_startup_failed = True
+           if conf.system.can_switch_tty:
+               util.vtActivate(1)
            time.sleep(2)

@VladimirSlavik
Copy link
Contributor

@bitcoffeeiux , I see two things to change:

  1. Please fix the commit message to fit our conventions: First line as description, then details max. 79 chars per line, link to the bug like Resolves: rhbz#XXX. See example below.

  2. Add another commit with the tty switching so that people can actually see that Anaconda can continue.

If you could do that, I will be happy to approve this pull request.


Example commit message:

Handle correctly signal from delayed Xorg start

Handling of the SIGUSR1 signal after Xorg start timed out was wrong. If Xorg
eventually started and sent SIGUSR1 to Anaconda, the signal was not handled
and Anaconda crashed.

Resolves: rhbz#1918702

@VladimirSlavik VladimirSlavik self-assigned this Jan 21, 2021
@bitcoffeeiux
Copy link
Contributor Author

bitcoffeeiux commented Jan 22, 2021

great,i will

@bitcoffeeiux
Copy link
Contributor Author

Please review, thanks

@@ -254,7 +254,7 @@ def sigusr1_preexec():
finally:
# Put everything back where it was
signal.alarm(0)
signal.signal(signal.SIGUSR1, old_sigusr1_handler)
signal.signal(signal.SIGUSR1, signal.SIG_IGN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that it is necessary to restore the previous signal handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I forgot about that part. We use SIGUSR1 to test the crash handler. But if Xorg here keeps starting and eventually starts and sends it... goodbye Anaconda. So that's actually the original bug: Xorg sends the same signal that we use for debugging, and we take that to mean "let's crash".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could try to kill childproc if we don't get X11 started in time? Timeout calls sigalrm_handler, so that would set some flag, and in finally we'd check the flag and try to kill the process.

(One alternative solution I can think of is: Ignore the first SIGUSR1 after trying to start X. But that introduces behavior that is very hard to guess when poking the program.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other place to use this signal than a test crash accident in anaconda?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I haven't found the code that sends SIGUER1. Could I please change the SIGUER1 sent by XORG to other signals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I kill the Xorg process by calling terminate in finally block, is it possible that Xorg is calling IO and not responding to the signal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the man page:

       The X server attaches special meaning to the following signals:

       SIGHUP  This signal causes the server to close all  existing  connections,  free  all  re‐
               sources, and restore all defaults.  It is sent by the display manager whenever the
               main user's main application (usually an xterm or window manager) exits  to  force
               the server to clean up and prepare for the next user.

       SIGTERM This signal causes the server to exit cleanly.

       SIGUSR1 This  signal  is used quite differently from either of the above.  When the server
               starts, it checks to see if it has inherited SIGUSR1 as  SIG_IGN  instead  of  the
               usual SIG_DFL.  In this case, the server sends a SIGUSR1 to its parent process af‐
               ter it has set up the various connection schemes.  Xdm uses this feature to recog‐
               nize when connecting to the server is possible.

It doesn't look like it is possible to change SIGUSR1. However, it might be interesting to try SIGTERM to kill the X server.

@bitcoffeeiux
Copy link
Contributor Author

bitcoffeeiux commented Jan 25, 2021

I tried to modify the SIGALR_Handler code, but it was not very good, because if the X server was running too slowly and did not terminate after the end of the sleep, the anaconda would still crash.

    def sigalrm_handler(num, frame):
    # Check that it didn't make it under the wire
    if x11_started[0]:
        return
    log.error("Timeout trying to start %s", argv[0])
    if x11_started[0] is False and childproc is not None:
        WatchProcesses.unwatch_process(childproc)
        childproc.terminate()
        time.sleep(3)
    raise ExitError("Timeout trying to start %s" % argv[0])

@VladimirSlavik
Copy link
Contributor

Ouch. So, if I'm reading it right...

  • X will send SIGUSR1. I did not find anything to configure it differently. (manpages: xserver, xorg, xorg.conf)
  • Once we attempt to start X, we can't reliably prevent receiving SIGUSR1 from it. ("If I kill the Xorg process by calling terminate in finally block, is it possible that Xorg is calling IO and not responding to the signal")
  • We use SIGUSR1 to test our exception handler. (Fake crash.)
  • We do two different things when we get SIGUSR1: Normally it tests the exception handler, but when starting X, it means X started successfully.
  • If Xorg is too slow to start, we switch the meaning of SIGUSR1 back to exception testing, and if X is not stuck, then it kills us with is "I'm ok" report.
  • This pull request exists because the "killed by late Xorg start" actually did happen to an user.
  • We can't give up the exception handler testing with a signal - both Fedora and RHEL QE use it a lot.
  • SIGUSR2 is already taken by a similar debugging functionality which does not terminate Anaconda, only saves a traceback.

One way out would be to test exceptions with a different signal. Not sure which - SIGSTOP, SIGRTMIN+???... This requires more discussion within team, unfortunately.

Either way, any changes to this need more discussion, sorry @bitcoffeeiux :( Still, thank you for bringing this to our attention. I will get back to this once we know what we want to do.


In fact, I haven't found the code that sends SIGUER1.

Developers send it to test exception handler. In anaconda.py around line 600:

    # add our own additional signal handlers
    signal.signal(signal.SIGUSR1, lambda signum, frame:
                  exception.test_exception_handling())
    signal.signal(signal.SIGUSR2, lambda signum, frame: anaconda.dumpState())

@VladimirSlavik VladimirSlavik added the blocked Don't merge this pull request! label Jan 25, 2021
@bitcoffeeiux
Copy link
Contributor Author

OK, I will continue to wait for the news. Is this issue only open? Or close?

@VladimirSlavik
Copy link
Contributor

VladimirSlavik commented Jan 25, 2021

I will keep this as open, so that everyone knows this should be revisited. ...or that's the plan, anyway!

@VladimirSlavik
Copy link
Contributor

So, after discussing this, @poncovka came up with the idea to keep the handler for X11 until it gets something. That interferes with exception testing a bit, but only in the case where X11 failed. I've merged all the changes into one commit - see here: VladimirSlavik@82ff9fb

@bitcoffeeiux, if you want to make this your change, feel free to do so. Or if you want, I can do a new PR with the aggregated changes.

@jstodola, your thoughts? This would change the behavior a bit, even if only in case of a failure. (See linked commit message for details.)

@VladimirSlavik VladimirSlavik removed the blocked Don't merge this pull request! label Jan 25, 2021
@jstodola
Copy link
Contributor

@VladimirSlavik , the change you proposed looks reasonable to me. Really, it's a corner case, so sending the SIGUSR1 twice is fine if someone really needs to do it.

@bitcoffeeiux
Copy link
Contributor Author

There is also the question, if this modification, Xorg does not send a SIGUSER1 signal if it fails to start, then the first signal anaconda does not respond when using the SIGUSR1 signal for crash testing? Is this modified too much for the crash test?

@VladimirSlavik
Copy link
Contributor

Yes, that is true. The consensus is that it is an even more rare condition, so it is a good tradeoff.

@bitcoffeeiux
Copy link
Contributor Author

OK, I will revise it again and submit it as soon as possible

@jstodola
Copy link
Contributor

It doesn't have to be part of this PR, but anaconda should print hints how to workaround the X server timeout if the timeout is reached.
For example the following boot arguments can help for installation from slow media:

rd.live.ram (loads stage2 into memory)
inst.xtimeout=
inst.text (to avoid even trying to start GUI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
master Please, use the `f39` label instead.
4 participants