Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service state design flaw #24783

Closed
kaithar opened this issue Jun 18, 2015 · 17 comments
Closed

Service state design flaw #24783

kaithar opened this issue Jun 18, 2015 · 17 comments
Labels
Bug broken, incorrect, or confusing behavior P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale State-Module
Milestone

Comments

@kaithar
Copy link
Contributor

kaithar commented Jun 18, 2015

Ok, I get the intent of this but it's horribly flawed and I can't think how to fix it. The problem stems from these lines in states/service.py:dead()

    before_toggle_status = __salt__['service.status'](name, sig)
    before_toggle_enable_status = __salt__['service.enabled'](name)
    if not before_toggle_status:
        ret['comment'] = 'The service {0} is already dead'.format(name)

    # snip

    # only force a change state if we have explicitly detected them
    after_toggle_status = __salt__['service.status'](name)
    after_toggle_enable_status = __salt__['service.enabled'](name)
    if (
        (before_toggle_enable_status != after_toggle_enable_status) or
        (before_toggle_status != after_toggle_status)
    ) and not ret.get('changes', {}):
        ret['changes'][name] = func_ret
    return ret

This seems logical enough, get the status before, check if it's actually running, then afterwards get the status again and compare to try and determine if we made a change.

Only that doesn't work for an obvious reason... service.status is returning the pids of the processes it thinks belongs to that service. That might result in sane behaviour for something that has a single process model, but if you bring something like apache prefork in to the picture you have a reasonably constant failure scenario. Any service that forks worker processes off on a regular basis is going to trigger false alarms to greater or lesser extents.

Any ideas how the logic can be adjusted to be more reliable? Parent process detection might be a workable approach?

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

service.status is init's dependant your asumptions are mostly wrong.

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

if you want to fix something, fix your init script status command.

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

(really, i think your problem is a broken status command)
Apache can prefork, no problem with that, still that your init must keep track of your processes and answer correctly to the status command.

@kaithar
Copy link
Contributor Author

kaithar commented Jun 18, 2015

@kiorky With respect, no... this has nothing to do with the init script, I wouldn't be filing a bug if it were. Please don't tell me my "assumptions" are wrong without at least checking first. It's not constructive to solving issues.

Digging in further the problem is actually worse, it's inconsistently behaving the way I described based on the distribution. Summary version is... unless you're running on something it recognises as debian, bsd or redhat, your results are going to be wrong in some way. However, none of the service modules have the capacity to return anything from the init script beyond a strict boolean, so your comment is wrong in all cases. The init script cannot cause the behaviour I described, no matter how broken it is.

I've snipped the doc strings for sanity. However....

  • Freebsd, Netbsd and Openbsd, which aren't shown here as they're logically close enough to debian, don't document the sig parameter behaviour.
  • Gentoo has its behaviour accurately documented even though it's odd.
  • Redhat, which is documented, also doesn't document sig. It also doesn't behave consistently when sig is given since _service_is_upstart()preempts it.
  • Windows appears to have copied the doc string from the generic service.py, which is unfortunate as the behaviour is actually very different. It doesn't respect sig at all.

I can probably fix the Gentoo one, for openrc at any rate, but that doesn't solve the issue that the generic behaviour is going to break state reporting.
I would personally consider that service.py has to be considered the authoritative behaviour as it is the generic default, the terse distribution specific behaviour being a deviation from the expected interface.

Debian (Netbsd and Openbsd are the same except in how they generate the cmd variable):

def status(name, sig=None):
    if sig:
        return bool(__salt__['status.pid'](sig))
    cmd = _service_cmd(name, 'status')
    return not __salt__['cmd.retcode'](cmd)

Gentoo:

def status(name, sig=None):
    return __salt__['status.pid'](sig if sig else name)

Redhat:

def status(name, sig=None):
    if _service_is_upstart(name):
        cmd = 'status {0}'.format(name)
        return 'start/running' in __salt__['cmd.run'](cmd, python_shell=False)
    if sig:
        return bool(__salt__['status.pid'](sig))
    cmd = '/sbin/service {0} status'.format(name)
    return __salt__['cmd.retcode'](cmd, python_shell=False, ignore_retcode=True) == 0

Service.py, for all platforms not supported elsewhere:

def status(name, sig=None):
    return __salt__['status.pid'](sig if sig else name)

Windows:

def status(name, sig=None):
    cmd = ['sc', 'query', name]
    statuses = __salt__['cmd.run'](cmd, python_shell=False).splitlines()
    for line in statuses:
        if 'RUNNING' in line:
            return True
        elif 'STOP_PENDING' in line:
            return True
    return False

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

i will stop respond to you, you didnt read correctly my comments, @jfindlay @basepi @rallytime i m thinking that the bug is still invalid.

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

@basepi, the pb here is clearly the status command returning wrong information (mis implemented status command in one of his services)

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

@basepi it's the job of the init process (sysv, or managers like systemd/upstart and so on) to mess with processes, what does (and correctly) do service.* actually is to ask them via the status cmd. The problem is on the raw status got from the underlying init system.

@kaithar
Copy link
Contributor Author

kaithar commented Jun 18, 2015

@kiorky I have just provided the actual code responsible for handling the service.status calls, none of which are capable of retrieving a list of pids from an init script. None.

Every single call I could find from a service.status implementation to an init script is wrapped in either bool() or a boolean casting operator (== and in for redhat_service.py, not for debian_service.py, in for win_service.py).
Even if the init script was returning the wrong information, which I already know it isn't, it's still impossible for any output from that script to travel past the service.status wrapper functions.

Further the code I provided clearly demonstrates that what I'm describing is intended behaviour under some conditions.

If you're going to accuse me of not reading your comments, the least you could do is properly read my responses to said comments before making your assessment of my ability to read. Doing otherwise is simply wasting everyone's time, your own included.

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

pid for the salt's status cmd is unrelated.
status & enable return a boolean from which the service is respectivly activated or enabled at boot, point.

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

by a discussion we had on IRC, there is still a brittle point, but not on state, it's more on the virtual from modules where we should ensure that we map correctly to the right underlying system. (Like ensuring systemd for any linux if booted with with no hardcode), and idem for upstart & co. As it can be brittle for example, on ubuntu 15.04 and upward (this works correctly for now, but there is a bit of nuances in grains that can break at a later point). Maybe for the upstart module, we should have something like the salt.utils.systemd.systemd_booted function ensuring that the process manager aka PID_EINS is upstart (for example, personally,i have sometimes upstart always running as pid=1 (in some special containers, with a patched upstart that is not the first process, we have some special stuff).

@kiorky
Copy link
Contributor

kiorky commented Jun 18, 2015

In other words, virtuals should less rely on grains but more on running systems.
Frankly, i have a grief on that, it is to cache it, but out of the exec mods, as modules can be loaded and again, there will be a huge performance hit.

@basepi
Copy link
Contributor

basepi commented Jun 18, 2015

If pids are being returned, can't we just force those before/after toggle statuses to bools? I think that's really what the function is expecting anyway.

@basepi
Copy link
Contributor

basepi commented Jun 18, 2015

@kiorky I know you're worried about this being changed and introducing regressions, but you need to be less confrontational. It's obviously not working for someone else, and we need to carefully investigate why, and fix it if we can, without causing regressions in such a popular state function.

@kaithar Thanks for the report, see my comment above.

@jfindlay jfindlay added State-Module Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Platform Relates to OS, containers, platform-based utilities like FS, system based apps Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P2 Priority 2 labels Jun 19, 2015
@jfindlay jfindlay added this to the Approved milestone Jun 19, 2015
@kiorky
Copy link
Contributor

kiorky commented Jun 19, 2015

@basepi there are no pids returned, and the functions toggles are awaitening booleans already, the state logic, as i aforementioned is correct and must not changed at all cost for this reason.

I never said that there was not a bug, but the bug is really not in the state, which i think is now in good shape. The problem is in what certain exec mods return.

@basepi
Copy link
Contributor

basepi commented Jun 22, 2015

I think I agree with you, any service-resolving execution module should return a bool from service.status. We need to fix the problem there. @kaithar what kind of host are you using (so we know which service module is being used underneath)?

@kaithar
Copy link
Contributor Author

kaithar commented Jun 30, 2015

Sorry for the delay on replying, I was away for a week.

I agree with the notion that it should be returning a bool, the disagreement arises from the status of modules/service.py ... @kiorky asserts that it shouldn't be hit, I take the approach that shouldn't != won't when dealing with less common options.
I'm not using a host that actually suffers this issue, but I have suspicions that the generic module will be used on things like CoreOS, TinyCore, Puppy, LFS, OpenWRT... esseentially anything that modifies the OS grain return.
There's an incomplete list of distributions on wiki that illustrates just how many aren't even derived from one of the main ones.

The issue is further confused by the fact that there are both OS specific service modules (rh_service.py for example) and init specific modules (like upstart.py). I agree with the principle here, what I think is needed is a clear documentation on which module will run under what conditions, perhaps a table form would be best. The documentation in the modules isn't always the best...

upstart.py

def __virtual__():
    '''
    Only work on Ubuntu
    '''
    # Disable on these platforms, specific service modules exist:
    if salt.utils.systemd.booted(__context__):
        return False
    elif __grains__['os'] in ('Ubuntu', 'Linaro', 'elementary OS', 'Mint'):
        return __virtualname__
    elif __grains__['os'] in ('Debian', 'Raspbian'):
        debian_initctl = '/sbin/initctl'
        if os.path.isfile(debian_initctl):
            initctl_version = salt.modules.cmdmod._run_quiet(debian_initctl + ' version')
            if 'upstart' in initctl_version:
                return __virtualname__
    return False

debian_service.py

def __virtual__():
    '''
    Only work on Debian and when systemd isn't running
    '''
    if __grains__['os'] in ('Debian', 'Raspbian') and not salt.utils.systemd.booted(__context__):
        return __virtualname__
    return False

Something I'm not certain on is what happens when multiple modules have __virtual__ return the same string.
To use an example that I know should do this, Raspbian without systemd running could match the two tests above, but it isn't on the list of excluded platforms in modules/service.py:

    disable = set((
        'RedHat',
        'CentOS',
        'Amazon',
        'ScientificLinux',
        'CloudLinux',
        'Fedora',
        'Gentoo',
        'Ubuntu',
        'Debian',
        'Arch',
        'Arch ARM',
        'ALT',
        'SUSE  Enterprise Server',
        'OEL',
        'Linaro',
        'elementary OS',
        'McAfee  OS Server',
        'Mint'
    ))

So yeah, consistency is my first main objection to the current service implemenations.

My second main objection is that the documentation for modules/service.py starts off thus:

'''
The default service module, if not otherwise specified salt will fall back
to this basic module
'''

That to me says "this is the reference module, specific service modules should provide a superset of my behaviours" ... if one of the service implementations differ in behaviour from that default service module, the default has to be considered correct by definition.
I agree that returning a bool from service.status is much saner than returning a list of pids, but that doesn't change the fact that the spec is being broken and ignored. Obviously the correct thing to do in this case is to change the default module to return a bool, ideally someone needs to make sure that the behaviours are consistent.

@stale
Copy link

stale bot commented Nov 21, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Nov 21, 2017
@stale stale bot closed this as completed Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale State-Module
Projects
None yet
Development

No branches or pull requests

4 participants