Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service: serialize information about currently executing command #5354

Merged
merged 2 commits into from
Apr 24, 2017

Conversation

msekletar
Copy link
Contributor

@msekletar msekletar commented Feb 15, 2017

This PR is trying to address long standing issue that execution of commands is not resumed after daemon-reload.

Fixes #518

@msekletar msekletar added the pid1 label Feb 15, 2017
Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, but a few notes!

first = s->exec_command[id];
current = control ? s->control_command : s->main_command;

/* Figure out where we are in the list be walking back to the beginning */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: be → by

assert(c);

index = service_exec_command_index(u, id, control);
args = strv_join(c->argv, " ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing OOM check...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or maybe it's even OK if you ignore the OOM condition, as there's little you could do about it. But if you ignore it, then it shouldn't be used as-is in the fprintf() invocation below. Consider simply wrapping it in strempty() there, so that even if this fails we enter a defined state...

index = service_exec_command_index(u, id, control);
args = strv_join(c->argv, " ");

fprintf(f, "%s-command=%s %d %s %s\n", type, service_exec_command_to_string(id), index, c->path, args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so we might want to protect ourselves here from command lines that contain a newline. Those would result in borked serialization. I think it would be best to use something smarter than strv_join() for serializing the command line: build your own loop, and concatenate the arguments manually, after processing each individually through cescape(), escaping everything weird, including newlines and spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that systemd already throws away new lines and other "unimportant whitespace" while parsing unit files, so argv member never points to string that contains garbage whitespace like new lines.

Btw if above is not the case, shouldn't we fix that instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, fprintf() just writes stuff to the serialization as it is, including all whitespaces and weird chars. The idea is to escape what needs escaping beforehand.

Note that the serialization really is just a text file, not more, you could view it with a text editor, and that's totally OK.

During deserialization we process line-by-line, \n is our record separator if you so will, and if you insert additional \n within one record (i.e. line), then this will be read as two records...

@@ -2441,6 +2490,109 @@ static int service_deserialize_item(Unit *u, const char *key, const char *value,
s->watchdog_override_enable = true;
s->watchdog_override_usec = watchdog_override_usec;
}
} else if (streq(key, "main-command") || streq(key, "control-command")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STR_IN_SET...


state = STATE_EXEC_COMMAND_TYPE;

FOREACH_WORD(word, l, value, parser_state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use FOREACH_WORD() in new code anymore. Use a look around extract_first_word() instead...


temp = strndup(word, l);
if (!temp)
return log_oom();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so, we can't really recover from real errors here. I think the best we can do on OOM is log, stop processing this line but proceed with the next, and hence not return with -ENOMEM here.

index++;

return index;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this function cannot really fail, and can't validly return a negative number I think it would be best to use an unsigned return type for this, such as simply "unsigned"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value is still int here....

state = STATE_EXEC_COMMAND_ID;
break;
case STATE_EXEC_COMMAND_ID:
r = sscanf(temp, "%d", &index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use safe_atou() instead!

temp = NULL;
state = STATE_EXEC_COMMAND_ARGS;

assert(path_is_absolute(path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during deserialization we try to be defensive when it comes to data we read, after all we might do live updates from an older systemd version (or even live downgrades from a newer one!), and hence parse failures should not be fatal, and not even something we log about at a level higher than debug... An assert() is definitely too hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, previous systemd version either doesn't have the code for serialization of exec commands and we never reach this code, or it has that code and it better do it properly. If it is serialized "incorrectly" by previous systemd version then something is very broken.

but I can make it non fatal for sure :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, again, in theory downgrades are possible too, and you never know what the future brings. The idea is to write this stuff as defensibly as possible and just ignore bits we can't process and proceed with the next, instead of choking on them

else if (command)
s->main_command = command;
else
log_unit_notice(u, "Previously executed command vanished from the unit file, execution of the command list won't be resumed.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given how complex this is I think this bit should be mostly moved to its own function.

@@ -0,0 +1,2 @@
[datadir.paths]
logs_dir = /tmp/job-results
Copy link
Member

@evverx evverx Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need avocado-framework?

This test doesn't run tests inside VM/container (it uses the installed systemd, not the one from the built source tree). Why should we place this test in test/TEST-*-ISSUE-*?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW https://anonscm.debian.org/cgit/pkg-systemd/systemd.git/tree/debian/tests/upstream can't run this test of course. So, ubuntu-CI will fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CentOS CI doesn't run this test too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need avocado-framework?

We don't necessarily. But I think that framework is actually quite nice for writing high-level "black box" tests. It would be nice to agree (or at least advise to people) on the test framework if they need to write "integration test" for systemd. Anyway, I wrote test mostly for myself, in order to make sure that patch looks as intended. We don't have to include it upstream.

This test doesn't run tests inside VM/container (it uses the installed systemd, not the one from the built source tree). Why should we place this test in test/TEST--ISSUE-?

I don't follow? AFAICT, there are no other "issue specific" tests anywhere else in systemd source tree, except test/. This test exercises new code added to fix issue #518 so I added it to test/ directory.

Yes it will use systemd from the system but I don't think it is a job of the test case to prepare test environment.

Copy link
Member

@evverx evverx Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to agree (or at least advise to people) on the test framework if they need to write "integration test" for systemd

Well, test/networkd-test.py, test/sysv-generator-test.py use unittest

there are no other "issue specific" tests anywhere else in systemd source tree, except test/

Hm, my bad 09f6f45

Anyway, https://github.com/systemd/systemd/blob/master/test/README.testsuite mentions "the extended testsuite". I think we shouldn't mix that with other frameworks.

Yes it will use systemd from the system but I don't think it is a job of the test case to prepare test environment

Maybe, but we should not place such tests into test/TEST-*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As another option. There are some "beakerlib"-tests https://github.com/systemd/systemd-centos-ci/tree/master/slave/tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really stick to the framework we already have instead of adding new infrastructure just for one test...

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Feb 17, 2017
@msekletar
Copy link
Contributor Author

@poettering hopefully I addressed all issues that you've pointed out. @evverx note that test now uses only unittest.

@msekletar msekletar removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Feb 20, 2017
Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of more notes! sorry for not noticing them earlier!

index++;

return index;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value is still int here....


args = strv_join(escaped_args, " ");
if (!args)
return -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think instead of preparing a list and then calling strv_join() and throwing the list away again, it would be better to just prepare the string directly. something like this:

size_t length = 0, allocated = 0;
char *args = NULL;
…
STRV_FOREACH(arg, c->argv) {
        size_t n;
        char *e;

        e = cescape(arg);
        if (!e)
                return -ENOMEM;

        n = strlen(e);
        if (!GREEDY_REALLOC(args, allocated, length + 1 + n + 1))
                return -ENOMEM;

        if (length > 0)
                args[length++] = ' ';

        memcpy(args + length, e, n);
        length += n;
}

if (!GREEDY_REALLOC(args, allocated, length + 1)
        return -ENOMEM;
args[length++] = 0;
…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I just noticed: this should be xunescape(…, WHITESPACE) and not plain cescape() so that the inner whitespace is properly escaped, so that we don't get confused by it while parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that code could have been written more clever but I tried to optimize for the reader and tried to make it easy to grok what is going on. In general, only very small subset of units has main_command or control_command running at reload time hence I didn't bother optimizing too much since this code is going to be skipped for majority of units anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And honestly I don't even think that putting string together as we go is necessarily faster. And again I find proposed code much more elegant and easier to read (less manual juggling with the raw memory).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value is still int here....

Shouldn't we turn this into error. I can't really think of situation when I'd actually want this mismatched return types and possible silent misinterpretation of data to occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the direct version isn't even that much longer... And faster too. malloc() is very expensive after all...

To me as a read this just raises all kinds of questions: why turn this into an strv first, if we never actually need that... I am pretty sure it's reasier to grok if we don't take extra detours...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the return code of that function is still int, but it can't actually return any error, and we blindly return an "unsigned". I am pretty sure we should just stick to one type here, or make the cast explicit. But I think it's actually nice to just return "unsigned" here, simply to indicate that failure isn't possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both approaches are fine. And like @msekletar said, efficiency isn't an issue here.

if (!args)
return -ENOMEM;

fprintf(f, "%s-command=%s %d %s %s\n", type, service_exec_command_to_string(id), index, c->path, args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c->path needs escaping too, as it might container weird chars too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

for (;;) {
_cleanup_free_ char *arg = NULL;

r = extract_first_word(&value, &arg, NULL, EXTRACT_CUNESCAPE_RELAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any specific reason for using EXTRACT_CUNESPCAE_RELAX here? I fiugre we should use that only if we are not the ones defining the format. But in this case we are...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason, EXTRACT_CUNESCAPE_RELAX looked appropriate.

if (!candidate_args)
return -ENOMEM;

found = streq(args, candidate_args) && streq(command->path, path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why first format the arguments into a string? please use strv_equal() to compare the arguments directly.

return -ENOMEM;

if (streq(args, candidate_args) && streq(command->path, path))
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about strv_equal(). Thanks!

else if (command)
s->main_command = command;
else
log_unit_notice(u, "Previously executed command vanished from the unit file, execution of the command list won't be resumed.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should actually be upgraded to warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, warning is more appropriate. I figure this will happen rarely anyway.

@evverx
Copy link
Member

evverx commented Feb 21, 2017

@evverx note that test now uses only unittest.

@msekletar , I see, thanks

Anyway, test/TEST-*-* are special directories:
https://github.com/systemd/systemd/blob/master/test/README.testsuite

The extended testsuite only works with uid=0. It contains of several
subdirectories named "test/TEST-??-*", which are run one by one.

https://github.com/systemd/systemd/blob/master/test/Makefile

	$(MAKE) -C .. all
	@for i in TEST-[0-9]*; do \
		[ -d $$i ] || continue ; \
		[ -f $$i/Makefile ] || continue ; \
		make -C $$i all ; \

We shouldn't place random tests there.

Try to put into test/test-reload-exec-serialization or so.
Also, it make sense to run this test for every PR. Maybe, @martinpitt can help

CI failed. See #5354 (comment)

@evverx evverx added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Feb 21, 2017
@martinpitt
Copy link
Contributor

Try to put into test/test-reload-exec-serialization or so.

Or keep it as test/TEST* and copy the boilerplate from the other tests so that it can actually run as part of the "upstream" tests. It would be the first one written in Python, but as long as it's bilingual (CentOS only has Python 2, Ubuntu minimal installations only Python 3) and you call it through $(which python3 || which python2) mytest.py it should work.

You should be able to run the upstream test with something like

sudo make -C test/TEST-01-BASIC clean setup run

Try it on an exising test first to ensure you have QEMU et al installed, then try it on your new one.

Also, it make sense to run this test for every PR. Maybe, @martinpitt can help

Yeah, if it's going to be a custom one like test/networkd-test.py I can add a new autopkgtest for it.

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the testing framework. Not sure where it should live, but that's a minor issue.

I know this is a bit of a bike-shed issue, but still: why do you want to compare the command line including the command line args? In practice, command line args might include ephemeral stuff that changes after reload. For example some command line options that you changed and are reloading the unit for. This checking of arguments makes everything more complicated, but I don't see the gain. Why not just make the rules simpler and only compare the argv0?

class ExecutionResumeTest(unittest.TestCase):
def setUp(self):
self.unit = 'test-issue-518.service'
self.unitfile_path = '/run/systemd/system/{0}'.format(self.unit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No "0" needed.

import os
import tempfile

from pydbus import SystemBus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a new dep. Please add to README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or perhaps use gi.repository import Gio, GLib directly, i. e. the GDBus API: like Gio.bus_get_sync(Gio.BusType.SYSTEM,) None) or Gio.DBusProxy.new_sync() etc. pydbus just wraps that, and at least for these simple cases using a wrapper doesn't really buy much.

import tempfile

from pydbus import SystemBus
from enum import Enum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's an external dep in Python 2 that needs adding to the README.The python2-enum34 package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i thought we settled on strictly requiring python3 for everything, no? shouldn't we do the same here, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, right; I wrote that because there is no Python 3 in Centos 7 (where we run the "default" test), but there we just run the complete build without Python support:

checking for a Python interpreter with version >= 3... none

So I guess py 3 only is fine.


args = strv_join(escaped_args, " ");
if (!args)
return -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both approaches are fine. And like @msekletar said, efficiency isn't an issue here.

@@ -0,0 +1,186 @@
#!/usr/bin/python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other python scripts are using python3. #4761

msekletar added a commit to msekletar/systemd-centos-ci that referenced this pull request Mar 13, 2017
I am currently working on systemd/systemd#5354
and PR also introduces first python based test to TEST-* series. This
commit is prerequisite for running that test in Cent OS CI.

https://www.softwarecollections.org/en/scls/rhscl/rh-python35/
@msekletar
Copy link
Contributor Author

I am ready to resubmit the PR but I have a problem with new test. If we strictly require that all our python code runs under python3 then I will drop the test from PR. I am using pydbus library that requires gobject introspection python binding and this binding is not available on CentOS 7.

@msekletar
Copy link
Contributor Author

@martinpitt Any suggestions what to do about the test?

Copy link
Contributor

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to change the build system to only run the test if HAVE_PYTHON.

import os
import tempfile

from pydbus import SystemBus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or perhaps use gi.repository import Gio, GLib directly, i. e. the GDBus API: like Gio.bus_get_sync(Gio.BusType.SYSTEM,) None) or Gio.DBusProxy.new_sync() etc. pydbus just wraps that, and at least for these simple cases using a wrapper doesn't really buy much.

self.write_unit_file(UnitFileChange.REMOVAL)
self.reload_unit()

self.assertTrue(not os.path.exists(self.output_file))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but there's assertFalse which avoids the not.

os.remove(self.unitfile_path)
except OSError:
# ignore error if log file doesn't exist
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen that output_file doesn't exist, but unitfile_path does? Then you need to catch/ignore them independently.

@evverx
Copy link
Member

evverx commented Mar 14, 2017

@evverx
Copy link
Member

evverx commented Mar 14, 2017

I am using pydbus library that requires gobject introspection python binding and this binding is not available on CentOS 7.

Well, I don't understand why we need pydbus at all. We can run

subprocess.check_call(['systemctl', 'daemon-reload'])
subprocess.check_call(['systemctl', 'start', $unit, '--job-mode', 'replace'])

. No?

https://github.com/systemd/systemd/blob/master/test/networkd-test.py does exactly that. Try git grep systemctl test/networkd-test.py. Or am I missing something?

@msekletar
Copy link
Contributor Author

https://github.com/systemd/systemd/blob/master/test/networkd-test.py does exactly that. Try git grep systemctl test/networkd-test.py. Or am I missing something?

Sure I can do that.

As you said we don't need pydbus for this specific case. But we (internally at Red Hat) started working on systemd DBus API test suite. We were looking for nice and easy to use dbus library for python and we've chosen pydbus. I figured it would be easier to bring this dependency in before we post our DBus API test suite.

@msekletar
Copy link
Contributor Author

I've force pushed my branch. I've changed the test as suggested by @evverx.

Note CentOS CI failure seems unrelated. Looks like jenkins slave was unable to communicate with Duffy API service and obtain test bed.

@msekletar msekletar removed the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Mar 16, 2017
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with python-systemd; If not, see <http://www.gnu.org/licenses/>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/python-systemd/systemd/

#
# You should have received a copy of the GNU Lesser General Public License
# along with python-systemd; If not, see <http://www.gnu.org/licenses/>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment explaining how to run the test? I have to run something like

make && make install && systemctl daemon-reexec && ./test-exec-deserialization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw networkd-test contains

# ATTENTION: This uses the *installed* networkd, not the one from the built
# source tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this comment. Sure I can add similar comment to my test. Even though I think it is kinda obvious that if you don't do make && sudo make install && sudo init u then of course you will be testing your old systemd version and not the git version.

self.check_output(expected_output)

def test_added_after(self):
expected_output = 'foo\nbar\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expect foo\n here. Why should we resume an execution?

Copy link
Member

@evverx evverx Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I altered the test

diff --git a/test/test-exec-deserialization.py b/test/test-exec-deserialization.py
index 8dbd153..719101f 100755
--- a/test/test-exec-deserialization.py
+++ b/test/test-exec-deserialization.py
@@ -67,9 +67,9 @@ class ExecutionResumeTest(unittest.TestCase):
         unit_file_content = '''
         [Service]
         Type=oneshot
-        ExecStart=/bin/sleep 2
-        ExecStart=/bin/bash -c "echo foo >> {0}"
-        ExecStart=/bin/bash -c "echo bar >> {0}"
+        ExecStart=-/bin/sleep 2
+        ExecStart=-/bin/bash -c "echo foo >> {0}"
+        ExecStart=-/bin/bash -c "echo bar >> {0}"
         '''.format(self.output_file)
         self.unit_files[UnitFileChange.COMMAND_ADDED_AFTER] = unit_file_content

The test passed. Why should we ignore -?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expect foo\n here. Why should we resume an execution?

Sorry, I was wrong. I expect '' (not foo\nbar\n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expect foo\n here. Why should we resume an execution?

In setup_unit() we start following unit,

[Service]
Type=oneshot
ExecStart=/bin/sleep 2
ExecStart=/bin/bash -c "echo foo >> /tmp/test12345"

then we immediately replace unit file with following content and reload,

[Service]
Type=oneshot
ExecStart=/bin/sleep 2
ExecStart=/bin/bash -c "echo foo >> /tmp/test12345"
ExecStart=/bin/bash -c "echo bar >> /tmp/test12345"

Before reload unit's main-command was sleep 2 we store that information reload and since command is still there after reload at expected place we then continue executing command list as usual and hence both echo commands get executed.

@evverx
Copy link
Member

evverx commented Mar 16, 2017

Well, it seems wrong to resume altered units. Has this approach been discussed somewhere?

@msekletar
Copy link
Contributor Author

msekletar commented Mar 17, 2017

Well, it seems wrong to resume altered units. Has this approach been discussed somewhere?

AFAICT, this approach was first proposed here,

#518 (comment)

Except I am not serializing line numbers, instead I am using offsets in respective command list. Since those are not affected by addition of whitespace or configuration to unrelated sections ([Unit], [Install]).

@msekletar
Copy link
Contributor Author

msekletar commented Mar 17, 2017

The test passed. Why should we ignore -?

Well - in front of the command affects how state machine works in case that command fails but has no effect on what is going to be executed next in non-failure case, right?

@msekletar
Copy link
Contributor Author

I have fixed the comments now.

@evverx
Copy link
Member

evverx commented Mar 17, 2017

AFAICT, this approached was first proposed here

Thanks for the link. I'll take a look.

Well - in front of the command affects how state machine works in case that command fails but has no effect on what is going to be executed next in non-failure case, right?

Well, my point is that I don't want to run anything if some unit is altered somehow.

Anyway, I'm reading #3202 (comment)

However, the thing is that we have tons of options on all types of units that take no immediate effect on a "systemctl daemon-reload", and if we wanted to do this we really should do this properly, and expose a new property on all units that indicates whether there are config changes detected due to a "systemctl daemon-reload" that have not taken effect because the unit was not restarted yet.

Why should we make an exception for Exec*? Should we document that?

@msekletar
Copy link
Contributor Author

Why should we make an exception for Exec*? Should we document that?

This boils down to fundamental design trait of systemd. We have two state engines, one for jobs and one for units. However job state engine doesn't work with its own view of the world instead it uses units dependency graph and unit objects in that graph are affected by daemon-reload. Maybe the other model in which jobs would see their own copy of units they operate on would be better.

However, I think we need to do something about this, at least for Exec* option because currently systemd jobs are completely unreliable and stray reload can cause issues that are very hard to debug.

Honestly I'd want not to be in shoes of a sysadmin performing data recovery just to realize that last backup that was supposed to happen actually didn't. Because systemd didn't execute half of the procedure just because puppet at the same time decided to re-enable some random service that someone else disabled by accident.

Service *s = SERVICE(u);
ServiceExecCommand id;
ExecCommand *c;
int index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unsigned.

if (!p)
return -ENOMEM;

fprintf(f, "%s-command=%s %d %s %s\n", type, service_exec_command_to_string(id), index, p, args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and %u here.


/* Let's check whether exec command on given offset matches data that we just deserialized */
for (command = s->exec_command[id], i = 0; command; command = command->command_next, i++) {
if (i != index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters much in practice, but index could be uninitialized here. Also path could be NULL.

In fact, using a state machine for this seems overkill, since there's only a single trivial pass possible. Why not do something like this:

_cleanup_free_ char *idstr = NULL, *indexstr = NULL, *path = NULL;
char *arg;
r = extract_first_word(&value, &idstr, NULL, EXTRACT_CUNESCAPE);
if (r <= 0)
     return r < 0 ? r : -EUCLEAN;
id = service_exec_command_from_string(idstr);
if (id < 0)
       return -EINVAL;
r = extract_first_word(&value, &indexstr, NULL, EXTRACT_CUNESCAPE);
if (r <= 0)
     return r < 0 ? r : -EUCLEAN;
 r = safe_atou(indexstr, &index);
if (r < 0)
            return -EINVAL;
r = extract_first_word(&value, &path, NULL, EXTRACT_CUNESCAPE);
if (r <= 0)
     return r < 0 ? r : -EUCLEAN;
if (!path_is_absolute(path))
     return -EINVAL;
while ((r = extract_first_word(&value, &arg, NULL, EXTRACT_CUNESCAPE) > 0) {
     r = strv_push(&argv, arg);
     if (r < 0)
            break;
}
if (r < 0)
     return r;
...

That's both shorter and imho more bullet-proof.

else if (command)
s->main_command = command;
else
log_unit_warning(u, "Previously executed command vanished from the unit file, execution of the command list won't be resumed.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Previously executed", hm, it's more like "currently executed". Maybe "Current command vanished from the unit file, execution of the command list won't be continued"?

} else if (STR_IN_SET(key, "main-command", "control-command")) {
r = service_deserialize_exec_command(u, key, value);
if (r < 0)
log_unit_debug(u, "Failed to parse previously executed command: %s", value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_unit_debug_errno(u, r, "Failed to parse serialized command \"%s\": %m", value);

Makefile.am Outdated
@@ -5961,6 +5961,7 @@ EXTRA_DIST += \
units/systemd-networkd.service.m4.in \
units/systemd-networkd-wait-online.service.in \
test/networkd-test.py
test/test-exec-deserialization.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backslash!

@@ -2140,6 +2141,83 @@ _pure_ static bool service_can_reload(Unit *u) {
return !!s->exec_command[SERVICE_EXEC_RELOAD];
}

static unsigned service_exec_command_index(Unit *u, ServiceExecCommand id, bool control) {
Service *s = SERVICE(u);
unsigned index = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, and sorry for noticing this earlier: there's a public glibc function "index()", I'd thus recommend not overloading the name locally here, but instead pick a different name, for example "idx" or just "i", or something like that. (old gcc even used to warn about this)

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two minor other issues, you are welcome to ignore, as they are suggestions, not more.

@@ -2140,6 +2141,83 @@ _pure_ static bool service_can_reload(Unit *u) {
return !!s->exec_command[SERVICE_EXEC_RELOAD];
}

static unsigned service_exec_command_index(Unit *u, ServiceExecCommand id, bool control) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be nicer to drop the "control" argument here, and instead simply take an ExecCommand* parameter, which is then either passed as s->control_command or s->main_command.

} else {
type = "main";
c = s->main_command;
id = SERVICE_EXEC_START;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as above, instead of taking a "control" bool, I'd just take these three values as function parameters, so that the caller picks them the right way

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Mar 30, 2017
Stored information will help us to resume execution after the
daemon-reload.

This commit implements following scheme,

* On serialization:
  - we count rank of the currently executing command
  - we store command type, its rank and command line arguments

* On deserialization:
  - configuration is parsed and loaded
  - we deserialize stored data, command type, rank and arguments
  - we look at the given rank in the list and if command there has same
    arguments then we restore execution at that point
  - otherwise we search respective command list and we look for command
    that has the same arguments
  - if both methods fail we do not do not resume execution at all

To better illustrate how does above scheme works, please consider
following cases (<<< denotes position where we resume execution after reload)

; Original unit file
[Service]
ExecStart=/bin/true <<<
ExecStart=/bin/false

; Swapped commands
; Second command is not going to be executed
[Service]
ExecStart=/bin/false
ExecStart=/bin/true <<<

; Commands added before
; Same commands are problematic and execution could be restarted at wrong place
[Service]
ExecStart=/bin/foo
ExecStart=/bin/bar
ExecStart=/bin/true <<<
ExecStart=/bin/false

; Commands added after
; Same commands are not an issue in this case
[Service]
ExecStart=/bin/true <<<
ExecStart=/bin/false
ExecStart=/bin/foo
ExecStart=/bin/bar

; New commands interleaved with old commands
; Some new commands will be executed while others won't
ExecStart=/bin/foo
ExecStart=/bin/true <<<
ExecStart=/bin/bar
ExecStart=/bin/false

As you can see, above scheme has some drawbacks. However, in most
cases (we assume that in most common case unit file command list is not
changed while some other command is running for the same unit) it
should cause that systemd does the right thing, which is restoring
execution exactly at the point we were before daemon-reload.

Fixes systemd#518
@msekletar
Copy link
Contributor Author

I force pushed my feature branch with new version. Please have a look. Thanks!

@keszybz I left state machine in place (I mean pass is not trivial after all since there is loop over arguments). But if you insist I can remove it. Honestly, I like it because I immediately see from the code what is a structure of serialized data.

@poettering
Copy link
Member

i think all issues raised have been addressed in some form. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

6 participants