Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework unit loading to take into account all aliases #13119

Merged
merged 8 commits into from
Jul 30, 2019

Conversation

keszybz
Copy link
Member

@keszybz keszybz commented Jul 19, 2019

Only the last 7 patches are new, the rest is the same as in #13096. I'll keep rebasing this until the other one is merged.

For #11972.

@keszybz keszybz added the pid1 label Jul 19, 2019
@lgtm-com
Copy link

lgtm-com bot commented Jul 19, 2019

This pull request introduces 4 alerts when merging cfa3958 into f7e7bb6 - view on LGTM.com

new alerts:

  • 4 for FIXME comment

@keszybz
Copy link
Member Author

keszybz commented Jul 19, 2019

CentOS CI (Arch in KVM) is failing with:

$ systemctl enable rpcbind && systemctl start rpcbind
Failed to start rpcbind.service: Unit rpcbind.service not found.

It seems arch might be doing something special with how rpcbind is defined... No idea.

@lgtm-com
Copy link

lgtm-com bot commented Jul 19, 2019

This pull request introduces 4 alerts when merging 1e023f5 into a505166 - view on LGTM.com

new alerts:

  • 4 for FIXME comment

@mrc0mmand
Copy link
Member

mrc0mmand commented Jul 20, 2019

@keszybz I tweaked a little the sanity boot check in the CentOS 7 job (by adding systemd.log_target=console), maybe it'll help you with debugging https://ci.centos.org/job/systemd-pr-build/7517/artifact//systemd-centos-ci/artifacts_1U2hyV/bootstrap-logs-upstream.6iv/sanity-boot-check.log

@lgtm-com
Copy link

lgtm-com bot commented Jul 20, 2019

This pull request introduces 4 alerts when merging 2b1e0f4 into a505166 - view on LGTM.com

new alerts:

  • 4 for FIXME comment

@mrc0mmand
Copy link
Member

According to the CentOS CI, systemd doesn't detect newly created units without calling systemctl daemon-reload.

TEST-03-JOBS:

+ cat   # cat <<EOF >  /run/systemd/system/wait2.service...
+ cat   # cat <<EOF >  /run/systemd/system/wait5fail.service...
++ date -u +%s
+ START_SEC=1563640040
+ systemctl start --wait wait2.service
Failed to start wait2.service: Unit wait2.service not found.
[FAILED] Failed to start Testsuite service.
See 'systemctl status testsuite.service' for details.
         Starting End the test...

TEST-12-ISSUE-3171

...
U=/run/systemd/system/test.socket
cat <<'EOL' >$U
...
systemctl start test.socket
...
Jul 20 18:29:16 systemd-testsuite test-socket-group.sh[32]: Failed to start test.socket: Unit test.socket not found.
Jul 20 18:29:16 systemd-testsuite systemd[1]: test.socket: Collecting.
Jul 20 18:29:16 systemd-testsuite systemd[1]: test.socket: Failed to send unit remove signal for test.socket: Connection reset by peer
Jul 20 18:29:16 systemd-testsuite systemd[1]: Bus private-bus-connection: changing state CLOSING → CLOSED
Jul 20 18:29:16 systemd-testsuite systemd[1]: Got disconnect on private connection.
Jul 20 18:29:16 systemd-testsuite systemd[1]: Received SIGCHLD from PID 32 (test-socket-gro).
Jul 20 18:29:16 systemd-testsuite systemd[1]: Child 32 (test-socket-gro) died (code=exited, status=5/NOTINSTALLED)
Jul 20 18:29:16 systemd-testsuite systemd[1]: testsuite.service: Child 32 belongs to testsuite.service.
Jul 20 18:29:16 systemd-testsuite systemd[1]: testsuite.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jul 20 18:29:16 systemd-testsuite systemd[1]: testsuite.service: Failed with result 'exit-code'.
Jul 20 18:29:16 systemd-testsuite systemd[1]: testsuite.service: Changed start -> failed

etc.

@keszybz keszybz mentioned this pull request Jul 26, 2019
@keszybz keszybz changed the title WIP: Rework unit loading to take into account all aliases Rework unit loading to take into account all aliases Jul 26, 2019
@keszybz
Copy link
Member Author

keszybz commented Jul 26, 2019

According to the CentOS CI, systemd doesn't detect newly created units without calling systemctl daemon-reload.

Duh, a type (errno != -ENOENT) :(. Thanks for help diagnosing this.

I'm now happy with the latest version. It seems pretty clean. There's still some minor details to work out, but in general this should be mergeable.

@lgtm-com
Copy link

lgtm-com bot commented Jul 26, 2019

This pull request introduces 3 alerts when merging 6a08991 into 47685d9 - view on LGTM.com

new alerts:

  • 3 for FIXME comment

@keszybz keszybz added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Jul 26, 2019
"%s points to \"%s\" which is not a valid unit name: %m", filename, dst);
if (r2 == UNIT_NAME_INSTANCE)
return log_notice_errno(SYNTHETIC_ERRNO(EXDEV),
"%s: unit symlink target \"%s\" has instance name, rejecting.", filename, dst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is a new limitation that I am not sure I agree with? i mean, we previously said as long as the instance name matches you can have aliases, and that when searching for a file we'd always look for the name with the instance first, and without it as fallback. Why take this away? Are you sure this is nowhere used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll remove this limitation. (BTW, thanks, this is the kind of comments I was looking for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I remember why this is a problem. Normally aliases are functions (in the mathematical sense, i.e. there can by only one y for any given x). This is nicely enforced by the fact that a symlink can only point one way, and once we find the symlink, we don't look for any fragments with lower priority. But once we allow symlinks for aliases, we can have a@foo.service → b@foo.service and at the same time a@.service → c@.service, while both b@.service and c@service can have fragments on disk. So are now a@.service, b@.service, c@.service now all aliases? If yes, do we pick the fragment for b@.service or c@.service ?

Copy link
Member Author

@keszybz keszybz Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this can be even simpler: a@.service may exist on disk, and b@foo.service a@foo.service may be a symlink to b@foo.service. Now we have two candidate fragments again.

Maybe we should allow this, check that the instance alias matches the main alias, but warn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the rule is that if a@foo.serviceb@foo.service exists that a@.service is never itself read (but the drop-ins for that are.)

i.e. in your example, I think we should read (and in this order):

  1. b@foo.service
  2. a@.service.d/*.conf
  3. a@foo.service.d/*.conf
  4. b@.service.d/*.conf
  5. b@foo.service.d/*.conf

But not:

  • a@.service (i.e. that this is a symlink is irrelevant to us, we already found a unit with the a@foo.service chain)
  • c@.service
  • c@foo.service
  • c@foo.service.d/*.conf
  • c@.service.d/*.conf

i.e. we only collect the drop-ins on the symlink chain we actually follow, and we stop searching as soon as we found the first unit file that is not a symlink. that means we'd never be tempted to resolve a@.service at all, since we already found a unit file by traversing a@foo.service.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, the fact that we read a@.service.d/*.conf but not a@.service seems very arbitrary. Before, it didn't matter if a setting was defined in the fragment or in the drop-in (apart from ordering issues). But now the dropins start living a life independent of the main unit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't convince myself that this is OK.
We either consider the instanced unit b@foo independent of the template a@.service, or not.
If the first, then we shouldn't load a@.service or a@.service.d. If the second, we should load both.

src/shared/unit-file.c Outdated Show resolved Hide resolved
src/shared/unit-file.c Outdated Show resolved Hide resolved
src/analyze/analyze.c Outdated Show resolved Hide resolved
src/shared/unit-file.c Outdated Show resolved Hide resolved
src/shared/unit-file.c Outdated Show resolved Hide resolved
@mrc0mmand
Copy link
Member

TEST-15-DROPIN is failing:

Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + check_ko a1 Wants y.service
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + check_ok a1 Wants y.service
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + '[' 3 -eq 3 ']'
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: ++ systemctl show --value -p Wants a1
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + x='x.service y.service'
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + case "$x" in
Jul 26 14:59:46 systemd-testsuite test-dropin.sh[29]: + return 0
Jul 26 14:59:46 systemd-testsuite systemd[1]: testsuite.service: Main process exited, code=exited, status=1/FAILURE
Jul 26 14:59:46 systemd-testsuite systemd[1]: testsuite.service: Failed with result 'exit-code'.
Jul 26 14:59:46 systemd-testsuite systemd[1]: Failed to start Testsuite service.

In this part of the code:

check_ok a1 Wants x.service # see [2]
check_ko a1 Wants y.service

With appropriate description here:

# A weird behavior: the dependencies for 'a' may vary. It can be
# changed by loading an alias...
#
# [1] 'a1' is loaded and then "renamed" into 'a'. 'a1' is therefore
# part of the names set so all its specific dropins are loaded.
#
# [2] 'a' is already loaded. 'a1' is simply only merged into 'a' so
# none of its dropins are loaded ('y' is missing from the deps).

@lgtm-com
Copy link

lgtm-com bot commented Jul 27, 2019

This pull request introduces 3 alerts when merging f74e4b4 into 6fd79cc - view on LGTM.com

new alerts:

  • 3 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Jul 27, 2019

This pull request introduces 3 alerts when merging bb0e86a into 6fd79cc - view on LGTM.com

new alerts:

  • 3 for FIXME comment

@keszybz keszybz removed the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Jul 27, 2019
if (src_name_type < 0)
return log_notice_errno(src_name_type,
"%s: not a valid unit name \"%s\": %m", filename, src);
if (src_name_type == UNIT_NAME_INSTANCE)
return log_notice_errno(SYNTHETIC_ERRNO(EINVAL),
"%s: unit symlink has instance name, rejecting.", filename);

src_type = unit_name_to_type(src);
assert(src_type >= 0); /* unit_name_classify() checked the suffix already */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs update

src/shared/unit-file.c Outdated Show resolved Hide resolved
@poettering poettering added this to the v243 milestone Jul 29, 2019
Without this, repeated runs of "make -C TEST/... setup" fail when trying
to create the symlink.
I adjusted the tests to pass. I don't think the behaviour makes much sense,
even if we ignore the issue with "lazy loading" of aliases. E.g. in the
last section, the fact that dropins for yup@.service and yup@3.service are
not loaded seems to be a plain old bug.
It turns out most possible symlinks are invalid, because the type has to match,
and template units can only be linked to template units.

I'm not sure if the existing code made the same checks consistently. At least
I don't see the same rules expressed in a single place.
This reworks how we load units from disk. Instead of chasing symlinks every
time we are asked to load a unit by name, we slurp all symlinks from disk
and build two hashmaps:
1. from unit name to either alias target, or fragment on disk
   (if an alias, we put just the target name in the hashmap, if a fragment
    we put an absolute path, so we can distinguish both).
2. from a unit name to all aliases

Reading all this data can be pretty costly (40 ms) on my machine, so we keep it
around for reuse.

The advantage is that we can reliably know what all the aliases of a given unit
are. This means we can reliably load dropins under all names. This fixes systemd#11972.
I'm not convinced that this is useful enough to be included... But it is
certainly nice when debugging.
v2:
- do not watch mtime of transient and generated dirs

  We'd reload the map after every transient unit we created, which we don't
  need to do, since we create those units ourselves and know their fragment
  path.
@keszybz
Copy link
Member Author

keszybz commented Jul 30, 2019

Another update. I added an extensive test to verify the unit loading behaviour. It is first added in a version that passes before any changes, and then when the code is changed, the test is updated again. Somewhat surprisingly, the results for the old code don't make much sense.

The final result is not quite as you describe in #13119 (comment) (difference is crossed out):

  1. b@foo.service
  2. a@.service.d/*.conf
  3. a@foo.service.d/*.conf
  4. b@.service.d/*.conf
  5. b@foo.service.d/*.conf

But not:

  • a@.service (i.e. that this is a symlink is irrelevant to us, we already found a unit with the a@foo.service chain)
  • c@.service
  • c@foo.service
  • c@foo.service.d/*.conf
  • c@.service.d/*.conf

My thinking is: if we aliased a unit "away" from some template, this unit does not use this template, and we load neither the template fragment, nor any dropins declared for the template. OTOH, we load both the fragment and the dropins for the new template. Of course, we load all dropins for the instance, under all names. This means that declaring something directly in the template fragment and in the template dropin is equivalent.

I corrected some minor bugs on the way and cleaned up the debugging stmts and FIXMEs. (One remains, but that's something I'm leaving for later). PTAL.

@lgtm-com
Copy link

lgtm-com bot commented Jul 30, 2019

This pull request introduces 1 alert when merging 8027654 into 2d1b928 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

@poettering poettering added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Jul 30, 2019
@poettering
Copy link
Member

hmm, lots's of CI fialures?

@keszybz
Copy link
Member Author

keszybz commented Jul 30, 2019

It seems to be timeouts, e.g. on systemd-update-hwdb.service. I restarted the tests.

@keszybz
Copy link
Member Author

keszybz commented Jul 30, 2019

Failed to transfer image: Connection timed out — yeah, that's not likely to be related.

@keszybz
Copy link
Member Author

keszybz commented Jul 30, 2019

# fix paths in manpages; manually check the remaining /usr occurrences
# occasionally, with filtering out paths which are known to be in /usr:
# grep -r /usr debian/install/deb/usr/share/man/|egrep -v '/usr/local|os.*release|factory|zoneinfo|tmpfiles|kernel|foo|machines|sysctl|dbus|include|binfmt'
find debian/install/deb/usr/share/man/ -type f | xargs sed -ri 's_/usr(/lib/systemd/system|/lib/systemd/network|/lib/udev|/lib[^/]|/lib/[^a-z])_\1_g'
find: ‘debian/install/deb/usr/share/man/’: No such file or directory
sed: no input files

@poettering poettering merged commit 5756bff into systemd:master Jul 30, 2019
@evverx
Copy link
Member

evverx commented Jul 30, 2019

@keszybz @poettering could you at least cc @ddstreet when something like this happens? It looks like that particular issue can be fixed by passing -Dman=true to meson in debian/rules.

Another option (especially after #13221) would be to stop pretending anyone looks at Ubuntu CI when PRs are opened and simply turn it off.

@evverx
Copy link
Member

evverx commented Jul 30, 2019

@keszybz I'm not sure how you restarted Ubuntu CI but it seems the tests were run against the "master" branch of the Debian package where @mbiebl reverted the commit where -Dman=true was passed to meson in https://salsa.debian.org/systemd-team/systemd/commit/ccf7a5dc83928beac5bdadfabef911e6082131e2. The test should be run against the "experimental" branch. That commit hasn't been reverted there.

@keszybz
Copy link
Member Author

keszybz commented Jul 30, 2019

Oh, I just use the ./retry-gh-systemd-test that I got from @martinpitt ages ago. It's possible that it is somehow outdated...

@keszybz keszybz deleted the unit-loading-2 branch July 30, 2019 17:12
@evverx
Copy link
Member

evverx commented Jul 30, 2019

@keszybz could you add the script to the repository (without the api keys) so that we could more or less keep it up to date? By the way, we discussed this fragile scheme (which we also rely on on Semaphore) in #12980.

@ddstreet
Copy link
Contributor

@keszybz @poettering could you at least cc @ddstreet when something like this happens?

Yes please do - I'm happy to look at any Ubuntu CI failures/problems. I'm working on getting the Ubuntu CI more stable/reliable, as well.

keszybz added a commit to keszybz/systemd that referenced this pull request Dec 21, 2019
This mostly reuses existing checkers used by pid1, so handling of aliases
should be consistent. Hopefully, with the test it'll be clearer what it
happening.

Support for .wants/.requires "aliases" is restored. Those are still used in the
wild quite a bit, so we need to support them.

See systemd#13119 for a discussion of aliases
with an instance that point to a different template: this is allowed.
@keszybz keszybz mentioned this pull request Dec 21, 2019
keszybz added a commit to keszybz/systemd that referenced this pull request Jan 9, 2020
This mostly reuses existing checkers used by pid1, so handling of aliases
should be consistent. Hopefully, with the test it'll be clearer what it
happening.

Support for .wants/.requires "aliases" is restored. Those are still used in the
wild quite a bit, so we need to support them.

See systemd#13119 for a discussion of aliases
with an instance that point to a different template: this is allowed.
keszybz added a commit to keszybz/systemd that referenced this pull request Jan 10, 2020
This mostly reuses existing checkers used by pid1, so handling of aliases
should be consistent. Hopefully, with the test it'll be clearer what it
happening.

Support for .wants/.requires "aliases" is restored. Those are still used in the
wild quite a bit, so we need to support them.

See systemd#13119 for a discussion of aliases
with an instance that point to a different template: this is allowed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed pid1
Development

Successfully merging this pull request may close these issues.

None yet

5 participants