Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metacity on fail reaped after timeout, fixes #4958 #682

Closed
wants to merge 5 commits into from
Closed

Metacity on fail reaped after timeout, fixes #4958 #682

wants to merge 5 commits into from

Conversation

mpdmanash
Copy link
Contributor

Bug: https://bugs.sugarlabs.org/ticket/4958
After calling metacity, the existence of the process is
checked after a timeout. If the process has exited, then
the defunct process is cleared using Popen.communicate()
and metacity is called again.

Bug: https://bugs.sugarlabs.org/ticket/4958
After calling metacity, the existence of the process is
checked after a timeout. If the process has exited, then
the defunct process is cleared using Popen.communicate()
and metacity is called again.
@quozl
Copy link
Contributor

quozl commented Apr 20, 2016

Thanks. Tested.

In shell.log is

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/jarabe/main.py", \
  line 185, in _metacity_timeout_cb
    except TimeoutExpired:
NameError: global name 'TimeoutExpired' is not defined

But not worth fixing, because you don't need to call subprocess.communicate, as .poll gives you what you need.

Process on system is now embedded in a shell parent; because of your shell=True; please don't do that, it wastes a process slot;

# ps axfww
...
 1444 ?        Ssl    0:00 /usr/sbin/lightdm
 1451 tty7     Ssl+   0:00  \_ /usr/lib/xorg/Xorg -core :0 ...
 1459 ?        Sl     0:00  \_ lightdm --session-child 12 15
 1464 ?        Ssl    0:04      \_ python2 -m jarabe.main
 1522 ?        Ss     0:00          \_ /usr/bin/ssh-agent /usr/bin/dbus-launch ...
 1555 ?        S      0:00          \_ /bin/sh -c metacity --no-force-fullscreen
 1558 ?        Sl     0:00          |   \_ metacity
 1652 ?        Sl     0:01          \_ /usr/bin/python /usr/bin/sugar-activity terminal ...
 1656 pts/1    Ss+    0:00              \_ /bin/bash

Played around with it a bit, and here's what I came up with for comparison;

_METACITY_TIMEOUT = 1000

...

def _metacity_timeout_cb():
    rc = _metacity_process.poll()
    if rc is None:
        return True

    logging.warning('metacity returncode %r' % rc)
    GObject.timeout_add(_METACITY_TIMEOUT, _metacity_restart_timeout_cb)
    return False


def _metacity_restart_timeout_cb():
    _start_metacity()
    return False


def _start_metacity():
    global _metacity_process

    _metacity_process = subprocess.Popen(['metacity', '--no-force-fullscreen'])
    GObject.timeout_add(_METACITY_TIMEOUT, _metacity_timeout_cb)


def _start_window_manager():
    logging.warning('_start_window_manager')

    settings = Gio.Settings.new('org.gnome.desktop.interface')
    settings.set_string('cursor-theme', 'sugar')

    _start_metacity()

    screen = Wnck.Screen.get_default()
    screen.connect('window-manager-changed', __window_manager_changed_cb)

    _check_for_window_manager(screen)


def _stop_window_manager():
    global _metacity_process
    _metacity_process.terminate()

Not finished. There's a risk of metacity being restarted during shutdown; the timeout is not cancelled in _stop_window_manager. Hope that helps!

@mpdmanash
Copy link
Contributor Author

@quozl Thankyou.
Why do we need to keep the timeout alive with return True? Once metacity is running successfully, we can forget it right? I guess calling _metacity_timeout_cb once for every failed attempt of Popen metacity will be sufficient. As the timeout is not getting repeated, metacity will not get restarted.

I have modified my previous code by adding logging and removing communicate and shell=True.
Please have a look.

The log went:

_start_window_manager
metacity returncode None
metacity returncode 1

@quozl
Copy link
Contributor

quozl commented Apr 21, 2016

No, you cannot forget it just because it began running. The event which triggered the bug was a race condition on Ubuntu 16.04 on particularly fast hardware, which led to the new version of metacity terminating. This happened several minutes after Sugar startup.

It isn't necessary to use timeouts, you might instead add the subprocess as a source of events for the GObject main loop with io_add_watch. But if you do that, make sure the restart is delayed, otherwise a metacity bug which fails early would keep happening over and over.

@mpdmanash
Copy link
Contributor Author

@quozl

make sure the restart is delayed

Can we not simply remove the source for io_add_watch using source_remove in the method _stop_window_manager()?

@quozl
Copy link
Contributor

quozl commented Apr 22, 2016

Can we not simply remove the source

Yes, you can remove the source, but that's unrelated to delaying the restart.

Not delaying the restart can cause a cyclic behaviour, a loop, if the root cause of metacity failure persists. The delay is so that the user can remain in control of the system. For example, in the past we have observed;

  • a metacity failure as a result of window properties set by an activity, and;
  • running out of graphics memory; metacity cannot create resources on the X server,

In both scenarios, not adding a delay would make the system uncontrollable while Sugar instantly reacts to a metacity exit by starting a new metacity instance.

@mpdmanash
Copy link
Contributor Author

@quozl , Tried implementing with threading. Please review. It will now restart metacity after a 1sec delay every time it fails.

@quozl
Copy link
Contributor

quozl commented Apr 26, 2016

I won't review a threading implementation, because I won't use threading myself, and am not practiced enough to make a review. We use the GObject event loop model in Sugar. I'm sure you can achieve the same result without using threading.

@mpdmanash
Copy link
Contributor Author

I was skeptical about it and wanted to know if it is allowed. Will change it. 👍

@mpdmanash
Copy link
Contributor Author

I have removed threading and used io_add_watch instead to reap metacity.
Please review.

@samdroid-apps
Copy link
Contributor

Looks clean, but I'd love to get @quozl back to look at this. He seems to know a lot about this area.

@quozl
Copy link
Contributor

quozl commented May 23, 2016

Thanks, but no plans to do full review. Some questions;

  • do you need to remove the watch, or will it remove itself?
  • do you need to call subprocess.poll or .wait so that the process does not Zombie?
  • is there really any value in making the timeout a named constant at the start of the file instead of a literal constant of 1000? (As it stands, you have to move up 90 lines in the file to find the timeout.)

Perhaps you might collapse the commits and force push.

@mpdmanash
Copy link
Contributor Author

Hi @quozl and @samdroid-apps ,
Sorry for the long delay, I got busy with college and internship stuff.
I finally got time to look into this and I really want to resolve this bug.

do you need to remove the watch, or will it remove itself?

I am returning False on the timeout callbacks and hence they would be removed and not be repeated.

do you need to call subprocess.poll or .wait so that the process does not Zombie?

No, I found that the process when stalls does not zombie on using GObject.io_add_watch with GLib.IO_HUP.

is there really any value in making the timeout a named constant at the start of the file instead of a literal constant of 1000? (As it stands, you have to move up 90 lines in the file to find the timeout.)

I just added _METACITY_TIMEOUT = 1000 in the beginning in case it has to be tweaked.

Once its reviewed, I would squash the commits, resolve the conflict and push.

Thanks

@quozl
Copy link
Contributor

quozl commented Oct 24, 2016

Welcome back.

do you need to remove the watch, or will it remove itself?

I am returning False on the timeout callbacks and hence they would be removed and not be repeated.

No, I'm not asking about the timeout_add, but instead about the io_add_watch. At the moment, the watch is not removed, and so a reference would persist to the subprocess object, even though the global _metacity_process will be assigned a new subprocess object during recovery.

do you need to call subprocess.poll or .wait so that the process does not Zombie?

No, I found that the process when stalls does not zombie on using GObject.io_add_watch with GLib.IO_HUP.

Good, thanks.

... you have to move up 90 lines in the file to find the timeout

I just added _METACITY_TIMEOUT = 1000 in the beginning in case it has to be tweaked.

Yes, I see what you did, but having it 90 lines away from where it is used seems very wasteful of comprehension time for the next person to look at the source. It also does not seem likely to need to be tweaked. My preference is to use a literal or a variable which is immediately above where it is used.

@quozl
Copy link
Contributor

quozl commented Nov 9, 2016

Alternate implementation at #733

@ManashRaja, please review?

@i5o
Copy link
Contributor

i5o commented Jan 2, 2017

Hello. I have merged #733 , that pull requests fixes the SL bug #4958
Thanks everyone.

@i5o i5o closed this Jan 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants