Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xorg timeout give up exception handler test #3304

Conversation

VladimirSlavik
Copy link
Contributor

@VladimirSlavik VladimirSlavik commented Apr 22, 2021

Another attempt at the X thing. This gives up the exception handler test temporarily, and solves almost everything.

The main problem of other solutions is that once X starts, it steals the screen by going to tty6. If the exception handler test-invoking handler is not returned back immediately, an "after-timeout" handler can be installed instead, which switches back to tty1.

With that in place, it's also safe to terminate Xorg once it's clear it's not coming in time. The termination will happen later, but that does not matter any more.

Finally, with the termination happening, it is also safe to return the crash report text handler.

Resolves: rhbz#1918702

Previous work: #3107, #3132, #3141, #3295. Thanks to @bitcoffeeiux and @poncovka.

The one avenue left unexplored is using the -displayfd option.

@pep8speaks
Copy link

pep8speaks commented Apr 22, 2021

Hello @VladimirSlavik! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-06 18:09:15 UTC

@VladimirSlavik
Copy link
Contributor Author

So, as it is right now, it's about 100% of my original vision.

But there's still more. I find it jarring that the options for dealing with timeout are printed with timestamps. That's because they don't come from print() but rather stdout_log.warning(). I can't tell (yet?) what's the difference, but with print I get the exceptions bubbling up (again!?!) and whatnot.

I am not sure what this means. The code contains sleep(2) which smells of some timing stuff. I traced that particular line to a commit from 2001 with no relevant information.

@VladimirSlavik
Copy link
Contributor Author

I was wrong, I can replace stdout_log with print. I just have to do that for all calls in the function.

From the history, it seems that stdout_log replaced print, and later got replaced by the logging we have now. It might be time to check if it's needed at all.

@jstodola
Copy link
Contributor

I gave it a try and the first attempt failed this way:
Screenshot_20210422_215854
Tested with just "inst.xtimeout=1" and updates.img in a VM.

@jstodola
Copy link
Contributor

And second attempt:
Screenshot_20210422_220542
This time with inst.xtimeout=5 and limited cdrom speed:
# virsh blkdeviotune kill_x sda --read-bytes-sec 500000

@VladimirSlavik
Copy link
Contributor Author

/kickstart-test --testtype smoke

@github-actions
Copy link

Requested workflow run has to be approved. You can do that here

@VladimirSlavik VladimirSlavik marked this pull request as ready for review April 23, 2021 16:11
@VladimirSlavik
Copy link
Contributor Author

I think I finally got why the exception weirdness happened. The one part is that it was tied to killing the process, which was asynchronous, so it popped up elsewhere. The other part is that Python's finally does not handle the exception, it keeps propagating. With these two bits, there are no mysteries left.

Copy link
Contributor Author

@VladimirSlavik VladimirSlavik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some self-conscious notes.

Apart from the below, commit "Reorganize control flow in startX" also needs amending the message to drop the sentence about exception mysteries.

pyanaconda/display.py Outdated Show resolved Hide resolved
pyanaconda/display.py Outdated Show resolved Hide resolved
pyanaconda/core/util.py Outdated Show resolved Hide resolved
@jstodola
Copy link
Contributor

The change seems to work nice, Xorg is killed and I can select text/vnc mode.

There is one issue and I wonder if you can also reproduce it - starting anaconda in text mode after the X timeout expired results in magnified TUI. When I choose vnc instead of text mode, the TUI to enter the vnc password looks fine. Reproduced 2 out of 2 attempts:
Screenshot_20210426_213646

Copy link
Contributor

@poncovka poncovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for solving this issue!

# Use a list so the value can be modified from the handler function
x11_started = [False]
# Use a class so the values can be modified from the handler functions
class X11Status:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to define the class outside of the function.

@VladimirSlavik
Copy link
Contributor Author

The resolution change happens because after X failure, we reset locale which includes running setfont and that destroys the resolution. Frankly, I am not sure what should be done here. Under "normal" text mode, we do not change fonts, which preserves the screen size and breaks TUI for non-ansi characters - it's all rectangles.

Screenshot_vs_main_2021-05-04_18:09:38

The tradeoff actually seems fair, in a bizarre way: Choose between the comforts of Anaconda in ýóůř ĺáňĝúàğë or above-VGA resolution.

We could presumably restore resolution with fbset, but I can't even find it (quickly) in Fedora.


Here we try to set up Xorg, and reinitialize_locale() if that fails, which eventually calls setup_locale() and set_console_font() (here), and that runs setfont <something> (here).

Copy link
Contributor

@jstodola jstodola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this.
I think the current solution is OK, it displays possible workarounds and allows to continue with the installation, despite the changed resolution. The resolution problem can be solved in a separate PR if necessary.

@VladimirSlavik
Copy link
Contributor Author

VladimirSlavik commented May 6, 2021

Thank you, let's have it then. That means an interactive rebase session to address the review comments above.

For the record, fbset would likely enable a pretty painless fix for this. However, it is orphaned in Fedora and apparently nobody missed it enough for the past 3 years: https://src.fedoraproject.org/rpms/fbset

- Raise a more semantically correct exception
- Clarify comments
- Replace list with a class instance

Related: rhbz#1918702
When Anaconda receives the SIGUSR1 signal, it executes its crash handler. When
Anaconda launches Xorg and it successfully starts, it sends SIGUSR1 to
Anaconda. Thus, we use a special handler for this signal while waiting for
Xorg to start, and a "normal" one otherwise.

Previously, if a Xorg start timed out, the temporary Xorg start handler was
always replaced with the handler that started exception handler test. However,
if Xorg kept starting in the background, it would eventually send that signal,
which would be interpreted as a test of the exception handler. Then, Anaconda
would stop with the error report message.

This changes the handling of SIGUSR1 after Xorg start times out, such that it
is no longer wrongly interpreted as a QE test, but correctly as a late Xorg
start. To prevent the newly started Xorg from stealing the screen, receiving
this signal switches back to tty1 with tmux and TUI.

Resolves: rhbz#1918702
Previously, control flow in case of timeout was very complicated:
- sigalrm_handler() in startX() raised an exception
- finally in startX() caught the exception and executed
- except in pyanaconda.display.do_startup_x11_actions() caught the exception
  and executed too
That spread the handling over two different places.

This changes the control flow to make it obvious the caller of startX will
receive an exception. It is also easier to reason about the flow.

Related: rhbz#1918702
After Xorg timeout, we can receive SIGUSR1 from two sources:
- Xorg if it managed to start
- User trying to initiate the crash report test

To make the first case predictable, terminate Xorg if it times out. It will
still send SIGUSR1, because SIGTERM is handled later. However, this
guarantees that there will be no more than one SIGUSR1 from Xorg.

With this fact ensured, reinstate the normal signal handler after first such
signal is received. Any further SIGUSR1 can be sent only by user trying to
test crash reporting.

An user sending this signal before a timed-out Xorg starts is not probable,
but still technically possible. That particular corner case remains buggy:
If that happens, the first signal (from user) will be "eaten" and appear to be
ignored, while the second one (from Xorg) will run a crash report test, but
without switching back to tty1, it might not be visible because Xorg switches
to tty6. This is mainly of interest to QE (RTT).

Related: rhbz#1918702
- Add separate timeout message with workarounds
- Print the messages instead of logging them
- Move tty switching to where messages are printed

Related: rhbz#1918702
@VladimirSlavik VladimirSlavik force-pushed the master-xorg-give-up-exception-handler-test branch from 195ac51 to 9b8d428 Compare May 6, 2021 17:38
@VladimirSlavik VladimirSlavik temporarily deployed to staging May 6, 2021 17:38 Inactive
@VladimirSlavik VladimirSlavik temporarily deployed to staging May 6, 2021 17:38 Inactive
The message is not a positional argument.
@VladimirSlavik VladimirSlavik temporarily deployed to staging May 6, 2021 18:09 Inactive
@VladimirSlavik VladimirSlavik temporarily deployed to staging May 6, 2021 18:09 Inactive
@VladimirSlavik
Copy link
Contributor Author

I addressed all of the feedback and squashed some of the commits. That's ready to merge, I think.

There was another bug left: ask_vnc_question did not ask, in fact. The choices appeared without the preceding message. I am almost certain it didn't work before. So I fixed that too...

@VladimirSlavik
Copy link
Contributor Author

/kickstart-test --testtype smoke

@github-actions
Copy link

github-actions bot commented May 6, 2021

Requested workflow run has to be approved. You can do that here

@VladimirSlavik VladimirSlavik merged commit 650608e into rhinstaller:master May 7, 2021
@VladimirSlavik VladimirSlavik deleted the master-xorg-give-up-exception-handler-test branch May 7, 2021 13:05
@VladimirSlavik VladimirSlavik removed the port to RHEL9 The pull request needs to be ported to RHEL 9. label May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
master Please, use the `f39` label instead.
4 participants