New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sandbox/cgroup: wait for start transient unit job to finish #10860
sandbox/cgroup: wait for start transient unit job to finish #10860
Conversation
The snap binary calls appropriate systemd instance to start a transient unit that wraps the scope of the snap application. The code used to implement a busy loop, checking whether the current process has been moved to new unit. However, we should actually implement a complete job handlign sequence like systemd-run does, that it wait for JobRemoved signal that matches our create transient unit request. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Codecov Report
@@ Coverage Diff @@
## master #10860 +/- ##
==========================================
+ Coverage 78.18% 78.20% +0.01%
==========================================
Files 910 910
Lines 102833 102933 +100
==========================================
+ Hits 80402 80496 +94
- Misses 17408 17410 +2
- Partials 5023 5027 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few minor comments here and there.
// only in tests | ||
continue | ||
} | ||
if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: since you care only about this signal, I'd add a WithMatchMember
to the subscription before, so we avoid being woken up twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But is this check still needed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just being a bit paranoid about the matched signal
sandbox/cgroup/tracking.go
Outdated
} | ||
case sig := <-signals: | ||
if sig == nil { | ||
// only in tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very clear to me; can't the tests be fixed not to send a null signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed now.
closeChan := make(chan struct{}) | ||
defer close(closeChan) | ||
signals := make(chan *dbus.Signal, 10) | ||
jobResultChan := make(chan string, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm entering an uncharted (for me) golang territory here, but wouldn't it be better if this channel transimtted error
types?
Then you could send back an error if operations like dbus.Store(...)
below fail, send nil
if the Job status is done
, and create another error otherwise.
sandbox/cgroup/tracking.go
Outdated
defer close(closeChan) | ||
signals := make(chan *dbus.Signal, 10) | ||
jobResultChan := make(chan string, 1) | ||
jobWaitFor := make(chan string, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use dbus.ObjectPath here? then we wouldn't need the string casts.
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
The condition/lock/shared buffer thingy was failing when running tests with -count=100 due to proper lack of serialization. Refactor the code to use channels, such that it is simpler to follow. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
9118ef9
to
f7f4f8f
Compare
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…-wait-systemd-group-job-done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just please have a look at the question inline, but it may be that my concern is not founded.
// only in tests | ||
continue | ||
} | ||
if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But is this check still needed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things are a bit unclear to me whether they are correct or the easiest/simplest way to approach them, but overall I think it is the right direction to move in instead of blindly doing the busy loop like we were before, so thanks for improving that, and thanks for improving the dbustest package in this way
if err := conn.AddMatchSignal(jobRemoveMatch...); err != nil { | ||
return fmt.Errorf("cannot subscribe to systemd signals: %v", err) | ||
} | ||
signals := make(chan *dbus.Signal, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's special about the number 10 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making some place for signals in case there's a lot of things going on in systemd.
@@ -284,6 +296,63 @@ var doCreateTransientScope = func(conn *dbus.Conn, unitName string, pid int) err | |||
properties, | |||
aux, | |||
) | |||
var wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment here about the flow we are performing would be nice to see, essentially walking through the various steps of the process and the order we expect them to happen in and possibly even pointers to where in systemd-run this code was inspire from would be really helpful in verifying correctness of the behavior here
expectedJob := dbus.ObjectPath("") | ||
for { | ||
select { | ||
case job, ok := <-jobWaitFor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like one thing that is not clear from this code is how many times we expect to hit this codepath, how many jobs do we expect to be sent on jobWaitFor? it seems like just one?
could we instead just block normally to receive the job on jobWaitFor
channel first and then go into the infinite for loop getting results from the signals channel? that would simplify a bit understanding the flow of things here across routines etc a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I don't think there is a guarantee that by the time Call/Store() finish, the signal couldn't be sent by systemd already. We could workaround that by having a non-0 buffer, but we still have to inspect the job path or unit, as there may have been other jobs in progress,
sandbox/cgroup/tracking.go
Outdated
defer func() { | ||
close(closeChan) | ||
wg.Wait() | ||
}() | ||
wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear this code is necessary so we don't leak a go routine when/if we have to return an error from this overall function after we have started the go routine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just to make sure that it's gone as we return.
dbusutil/dbustest/dbustest.go
Outdated
} | ||
|
||
func (s *testDBusStream) decodeRequest() { | ||
func (s *testDBusStream) decodeRequest(req []byte) { | ||
buf := bytes.NewBuffer(req) | ||
// s.m is locked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comments about s.m are no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -34,10 +34,10 @@ const testDBusClientName = ":test" | |||
// DBusHandlerFunc is the type of handler function for interacting with test DBus. | |||
// | |||
// The handler is called for each message that arrives to the bus from the test | |||
// client. The handler can respond by returning zero or more messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would have been kinda nice to have these changes in a separate PR but I understand that's not quite how things evolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will likely be blocked, so I can look into extracting the relevant bits to a separate PR
conn, err := dbus.NewConn(newTestDBusStream(handler)) | ||
type InjectMessageFunc func(msg *dbus.Message) | ||
|
||
func InjectableConnection(handler DBusHandlerFunc) (*dbus.Conn, InjectMessageFunc, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could use a doc-comment, especially an example of what situation one would want to use InjectableConnection() for instead of Connection()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Per standup discussion, I'll push a workaround to use the new code path always on cgroup v2 systems, but leave it as it is on v1. Added a blocked label for now. |
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Some distros use an older and broken systemd, where creating user scopes always fails. Fortunately, those systems also use cgroup v1. On cgroup v2 we absolutely need to have a scope, as otherwise we risk manipulating the wrong cgroup. It so happens, that v2 systems also use a newer version of systemd which can create user scopes. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…-wait-systemd-group-job-done-wip
…-wait-systemd-group-job-done
The spread tests use that line as a canary. Make sure that it pops up for both v2 and v1 code paths. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…and 21.10 Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…-wait-systemd-group-job-done
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…ude systems Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
for { | ||
select { | ||
case job, ok := <-jobWaitFor: | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reach !ok case for this channel? This would normally happen if it was explicitly closed but we are not doing this. Could we close it to signify end of processing and get rid of closeChan (moving the logic of conn.RemoveSignal.. here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tweaked the code a bit, also noticed that the signals channel isn't really closed so I fixed that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes.
sandbox/cgroup/tracking.go
Outdated
// establishing a device cgroup filtering in the wrong group | ||
return doCreateTransientScopeJobRemovedSync(conn, unitName, pid) | ||
} | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove // ?
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…-wait-systemd-group-job-done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel logic looks fine, a few minor suggestions. Thanks
expectedJob = job | ||
} | ||
case sig, ok := <-signals: | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never reach !ok since we close(signals) and exit the select above, but it's fine to have an extra check; might be worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also pass the signals channel as an argument to go-dbus. I saw some code there that closes he signals channel in the cleanup path.
for { | ||
select { | ||
case job, ok := <-jobWaitFor: | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes.
sandbox/cgroup/tracking.go
Outdated
if result != "done" { | ||
return fmt.Errorf("transient scope could not be started, job %v finished with result %v", job, result) | ||
} | ||
case <-timeout.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, maybe just case <-time.After(...):
and then timeout := time.NewTimer.. is not needed?
Thanks to @stolowski for the suggestion Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@mvo5 this is ready to land now, can you merge it? |
The snap binary calls appropriate systemd instance to start a transient unit
that wraps the scope of the snap application. The code used to implement a busy
loop, checking whether the current process has been moved to new unit. However,
we should actually implement a complete job handlign sequence like systemd-run
does, that it wait for JobRemoved signal that matches our create transient unit
request.