sandbox/cgroup: wait for start transient unit job to finish #10860

bboozzoo · 2021-09-29T10:15:52Z

The snap binary calls appropriate systemd instance to start a transient unit
that wraps the scope of the snap application. The code used to implement a busy
loop, checking whether the current process has been moved to new unit. However,
we should actually implement a complete job handlign sequence like systemd-run
does, that it wait for JobRemoved signal that matches our create transient unit
request.

The snap binary calls appropriate systemd instance to start a transient unit that wraps the scope of the snap application. The code used to implement a busy loop, checking whether the current process has been moved to new unit. However, we should actually implement a complete job handlign sequence like systemd-run does, that it wait for JobRemoved signal that matches our create transient unit request. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

codecov-commenter · 2021-09-29T10:39:32Z

Codecov Report

Merging #10860 (7d38e12) into master (482f8c7) will increase coverage by 0.01%.
The diff coverage is 91.07%.

@@            Coverage Diff             @@
##           master   #10860      +/-   ##
==========================================
+ Coverage   78.18%   78.20%   +0.01%     
==========================================
  Files         910      910              
  Lines      102833   102933     +100     
==========================================
+ Hits        80402    80496      +94     
- Misses      17408    17410       +2     
- Partials     5023     5027       +4

Flag	Coverage Δ
unittests	`78.20% <91.07%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sandbox/cgroup/tracking.go	`93.98% <91.07%> (+0.93%)`	⬆️
osutil/synctree.go	`76.41% <0.00%> (-2.84%)`	⬇️
daemon/api_connections.go	`93.58% <0.00%> (+0.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 482f8c7...7d38e12. Read the comment docs.

mardy

LGTM, just a few minor comments here and there.

dbusutil/dbustest/dbustest.go

mardy · 2021-09-29T10:54:47Z

sandbox/cgroup/tracking.go

+					// only in tests
+					continue
+				}
+				if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" {


minor: since you care only about this signal, I'd add a WithMatchMember to the subscription before, so we avoid being woken up twice.

Thanks! But is this check still needed then?

Just being a bit paranoid about the matched signal

mardy · 2021-09-29T10:57:15Z

sandbox/cgroup/tracking.go

+				}
+			case sig := <-signals:
+				if sig == nil {
+					// only in tests


This is not very clear to me; can't the tests be fixed not to send a null signal?

This is fixed now.

mardy · 2021-09-29T11:03:15Z

sandbox/cgroup/tracking.go

+	closeChan := make(chan struct{})
+	defer close(closeChan)
+	signals := make(chan *dbus.Signal, 10)
+	jobResultChan := make(chan string, 1)


I'm entering an uncharted (for me) golang territory here, but wouldn't it be better if this channel transimtted error types?
Then you could send back an error if operations like dbus.Store(...) below fail, send nil if the Job status is done, and create another error otherwise.

mardy · 2021-09-29T11:04:36Z

sandbox/cgroup/tracking.go

+	defer close(closeChan)
+	signals := make(chan *dbus.Signal, 10)
+	jobResultChan := make(chan string, 1)
+	jobWaitFor := make(chan string, 1)


Can we use dbus.ObjectPath here? then we wouldn't need the string casts.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

The condition/lock/shared buffer thingy was failing when running tests with -count=100 due to proper lack of serialization. Refactor the code to use channels, such that it is simpler to follow. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…-wait-systemd-group-job-done

mardy

LGTM, just please have a look at the question inline, but it may be that my concern is not founded.

mardy · 2021-09-30T10:18:29Z

sandbox/cgroup/tracking.go

+					// only in tests
+					continue
+				}
+				if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" {


Thanks! But is this check still needed then?

sandbox/cgroup/tracking.go

anonymouse64

Some things are a bit unclear to me whether they are correct or the easiest/simplest way to approach them, but overall I think it is the right direction to move in instead of blindly doing the busy loop like we were before, so thanks for improving that, and thanks for improving the dbustest package in this way

anonymouse64 · 2021-10-05T02:46:44Z

sandbox/cgroup/tracking.go

+	if err := conn.AddMatchSignal(jobRemoveMatch...); err != nil {
+		return fmt.Errorf("cannot subscribe to systemd signals: %v", err)
+	}
+	signals := make(chan *dbus.Signal, 10)


what's special about the number 10 here?

Just making some place for signals in case there's a lot of things going on in systemd.

anonymouse64 · 2021-10-05T02:49:04Z

sandbox/cgroup/tracking.go

@@ -284,6 +296,63 @@ var doCreateTransientScope = func(conn *dbus.Conn, unitName string, pid int) err
 		properties,
 		aux,
 	)
+	var wg sync.WaitGroup


a comment here about the flow we are performing would be nice to see, essentially walking through the various steps of the process and the order we expect them to happen in and possibly even pointers to where in systemd-run this code was inspire from would be really helpful in verifying correctness of the behavior here

anonymouse64 · 2021-10-05T02:55:00Z

sandbox/cgroup/tracking.go

+		expectedJob := dbus.ObjectPath("")
+		for {
+			select {
+			case job, ok := <-jobWaitFor:


like one thing that is not clear from this code is how many times we expect to hit this codepath, how many jobs do we expect to be sent on jobWaitFor? it seems like just one?

could we instead just block normally to receive the job on jobWaitFor channel first and then go into the infinite for loop getting results from the signals channel? that would simplify a bit understanding the flow of things here across routines etc a bit.

Hm I don't think there is a guarantee that by the time Call/Store() finish, the signal couldn't be sent by systemd already. We could workaround that by having a non-0 buffer, but we still have to inspect the job path or unit, as there may have been other jobs in progress,

anonymouse64 · 2021-10-05T02:57:20Z

sandbox/cgroup/tracking.go

+	defer func() {
+		close(closeChan)
+		wg.Wait()
+	}()
+	wg.Add(1)


to be clear this code is necessary so we don't leak a go routine when/if we have to return an error from this overall function after we have started the go routine ?

Yeah, just to make sure that it's gone as we return.

anonymouse64 · 2021-10-05T03:06:25Z

dbusutil/dbustest/dbustest.go

 }

-func (s *testDBusStream) decodeRequest() {
+func (s *testDBusStream) decodeRequest(req []byte) {
+	buf := bytes.NewBuffer(req)
 	// s.m is locked


the comments about s.m are no longer needed

anonymouse64 · 2021-10-05T03:06:54Z

dbusutil/dbustest/dbustest.go

@@ -34,10 +34,10 @@ const testDBusClientName = ":test"
 // DBusHandlerFunc is the type of handler function for interacting with test DBus.
 //
 // The handler is called for each message that arrives to the bus from the test
-// client. The handler can respond by returning zero or more messages.


it would have been kinda nice to have these changes in a separate PR but I understand that's not quite how things evolved

This one will likely be blocked, so I can look into extracting the relevant bits to a separate PR

anonymouse64 · 2021-10-05T03:07:58Z

dbusutil/dbustest/dbustest.go

-	conn, err := dbus.NewConn(newTestDBusStream(handler))
+type InjectMessageFunc func(msg *dbus.Message)
+
+func InjectableConnection(handler DBusHandlerFunc) (*dbus.Conn, InjectMessageFunc, error) {


this could use a doc-comment, especially an example of what situation one would want to use InjectableConnection() for instead of Connection()

bboozzoo · 2021-10-05T14:00:01Z

Per standup discussion, I'll push a workaround to use the new code path always on cgroup v2 systems, but leave it as it is on v1. Added a blocked label for now.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Some distros use an older and broken systemd, where creating user scopes always fails. Fortunately, those systems also use cgroup v1. On cgroup v2 we absolutely need to have a scope, as otherwise we risk manipulating the wrong cgroup. It so happens, that v2 systems also use a newer version of systemd which can create user scopes. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…-wait-systemd-group-job-done-wip

…-wait-systemd-group-job-done

The spread tests use that line as a canary. Make sure that it pops up for both v2 and v1 code paths. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…and 21.10 Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…-wait-systemd-group-job-done

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…ude systems Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

stolowski · 2021-10-27T11:37:44Z

sandbox/cgroup/tracking.go

+		for {
+			select {
+			case job, ok := <-jobWaitFor:
+				if !ok {


Can we reach !ok case for this channel? This would normally happen if it was explicitly closed but we are not doing this. Could we close it to signify end of processing and get rid of closeChan (moving the logic of conn.RemoveSignal.. here)?

Tweaked the code a bit, also noticed that the signals channel isn't really closed so I fixed that too

Thanks for these changes.

stolowski · 2021-10-27T11:44:02Z

sandbox/cgroup/tracking.go

+		// establishing a device cgroup filtering in the wrong group
+		return doCreateTransientScopeJobRemovedSync(conn, unitName, pid)
+	}
+	//


Remove // ?

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

…-wait-systemd-group-job-done

stolowski

The channel logic looks fine, a few minor suggestions. Thanks

stolowski · 2021-10-27T15:11:45Z

sandbox/cgroup/tracking.go

+					expectedJob = job
+				}
+			case sig, ok := <-signals:
+				if !ok {


We should never reach !ok since we close(signals) and exit the select above, but it's fine to have an extra check; might be worth a comment?

We also pass the signals channel as an argument to go-dbus. I saw some code there that closes he signals channel in the cleanup path.

stolowski · 2021-10-27T15:13:24Z

sandbox/cgroup/tracking.go

+		for {
+			select {
+			case job, ok := <-jobWaitFor:
+				if !ok {


Thanks for these changes.

stolowski · 2021-10-27T15:23:53Z

sandbox/cgroup/tracking.go

+		if result != "done" {
+			return fmt.Errorf("transient scope could not be started, job %v finished with result %v", job, result)
+		}
+	case <-timeout.C:


Nitpick, maybe just case <-time.After(...): and then timeout := time.NewTimer.. is not needed?

@stolowski

Thanks to @stolowski for the suggestion Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

bboozzoo · 2021-10-28T10:10:30Z

@mvo5 this is ready to land now, can you merge it?

mardy approved these changes Sep 29, 2021

View reviewed changes

bboozzoo added 2 commits September 29, 2021 15:34

dbusutil/dbustest: lock the output buffer

5d70da3

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

bboozzoo force-pushed the bboozzoo/snap-run-wait-systemd-group-job-done branch from 9118ef9 to f7f4f8f Compare September 29, 2021 14:10

bboozzoo added 6 commits September 30, 2021 09:54

dbusutil/dbustest: make closed flag atomic, refactor write

4448354

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

sandbox/cgroup: dbus signal channel is closed in a a separate case block

2781b8e

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

sandbox/cgroup: improved synchronization with signal handler exit

c66ebc5

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

dbusutil/dbustest: tweak comments

dd25ffe

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

sandbox/cgroup: tweak type use, better signal matcher

b46bd59

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Merge remote-tracking branch 'upstream/master' into bboozzoo/snap-run…

bfa4ad2

…-wait-systemd-group-job-done

mardy approved these changes Sep 30, 2021

View reviewed changes

anonymouse64 self-requested a review October 1, 2021 15:39

anonymouse64 reviewed Oct 5, 2021

View reviewed changes

This was referenced Oct 5, 2021

[qemu] /dev/net/tun: Operation not permitted on Arch Linux canonical/multipass#1610

Closed

cmd/snap-confine: die when snap process is outside of snap specific cgroup #10847

Merged

bboozzoo added the ⛔ Blocked label Oct 5, 2021

bboozzoo added 3 commits October 6, 2021 11:33

dbusutil/dbustest: drop unnecessary comment

c8b9e2d

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Merge remote-tracking branch 'upstream/master' into bboozzoo/snap-run…

2340eda

…-wait-systemd-group-job-done-wip

bboozzoo removed the ⛔ Blocked label Oct 6, 2021

bboozzoo added 6 commits October 25, 2021 10:39

Merge remote-tracking branch 'upstream/master' into bboozzoo/snap-run…

c7db508

…-wait-systemd-group-job-done

sandbox/cgroup: move the debug line

a1f1d39

The spread tests use that line as a canary. Make sure that it pops up for both v2 and v1 code paths. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

tests/main/cgroup-tracking-failure: update the test, run it on 21.04 …

6952e8e

…and 21.10 Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Merge remote-tracking branch 'upstream/master' into bboozzoo/snap-run…

e3d68cd

…-wait-systemd-group-job-done

tests/main/cgroup-tracking-failure: fix typos

a8fbcbe

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

tests/main/cgroup-tracking-failure: split 21.04 and 20.04 cases, incl…

0c4956c

…ude systems Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

anonymouse64 self-requested a review October 26, 2021 12:09

tests/main/cgroup-tracking-failure: case for ubuntu-core-20

214998b

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

stolowski reviewed Oct 27, 2021

View reviewed changes

bboozzoo added 2 commits October 27, 2021 14:39

sandbox/cgroup: drop auxiliary channel cleanup, tweaks

d3ed707

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

Merge remote-tracking branch 'upstream/master' into bboozzoo/snap-run…

648be25

…-wait-systemd-group-job-done

stolowski approved these changes Oct 27, 2021

View reviewed changes

sandbox/cgroup: use time.After

7d38e12

Thanks to @stolowski for the suggestion Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

mvo5 merged commit 547339e into snapcore:master Nov 2, 2021

sandbox/cgroup: wait for start transient unit job to finish #10860

sandbox/cgroup: wait for start transient unit job to finish #10860

Conversation

bboozzoo commented Sep 29, 2021

codecov-commenter commented Sep 29, 2021 • edited

Codecov Report

mardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anonymouse64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboozzoo commented Oct 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stolowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboozzoo commented Oct 28, 2021

codecov-commenter commented Sep 29, 2021 •

edited