Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandbox/cgroup: wait for start transient unit job to finish #10860

Merged

Conversation

bboozzoo
Copy link
Collaborator

The snap binary calls appropriate systemd instance to start a transient unit
that wraps the scope of the snap application. The code used to implement a busy
loop, checking whether the current process has been moved to new unit. However,
we should actually implement a complete job handlign sequence like systemd-run
does, that it wait for JobRemoved signal that matches our create transient unit
request.

The snap binary calls appropriate systemd instance to start a transient unit
that wraps the scope of the snap application. The code used to implement a busy
loop, checking whether the current process has been moved to new unit. However,
we should actually implement a complete job handlign sequence like systemd-run
does, that it wait for JobRemoved signal that matches our create transient unit
request.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2021

Codecov Report

Merging #10860 (7d38e12) into master (482f8c7) will increase coverage by 0.01%.
The diff coverage is 91.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10860      +/-   ##
==========================================
+ Coverage   78.18%   78.20%   +0.01%     
==========================================
  Files         910      910              
  Lines      102833   102933     +100     
==========================================
+ Hits        80402    80496      +94     
- Misses      17408    17410       +2     
- Partials     5023     5027       +4     
Flag Coverage Δ
unittests 78.20% <91.07%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sandbox/cgroup/tracking.go 93.98% <91.07%> (+0.93%) ⬆️
osutil/synctree.go 76.41% <0.00%> (-2.84%) ⬇️
daemon/api_connections.go 93.58% <0.00%> (+0.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 482f8c7...7d38e12. Read the comment docs.

Copy link
Contributor

@mardy mardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor comments here and there.

dbusutil/dbustest/dbustest.go Outdated Show resolved Hide resolved
// only in tests
continue
}
if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: since you care only about this signal, I'd add a WithMatchMember to the subscription before, so we avoid being woken up twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But is this check still needed then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being a bit paranoid about the matched signal

}
case sig := <-signals:
if sig == nil {
// only in tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear to me; can't the tests be fixed not to send a null signal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now.

closeChan := make(chan struct{})
defer close(closeChan)
signals := make(chan *dbus.Signal, 10)
jobResultChan := make(chan string, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm entering an uncharted (for me) golang territory here, but wouldn't it be better if this channel transimtted error types?
Then you could send back an error if operations like dbus.Store(...) below fail, send nil if the Job status is done, and create another error otherwise.

defer close(closeChan)
signals := make(chan *dbus.Signal, 10)
jobResultChan := make(chan string, 1)
jobWaitFor := make(chan string, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use dbus.ObjectPath here? then we wouldn't need the string casts.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
The condition/lock/shared buffer thingy was failing when running tests with
-count=100 due to proper lack of serialization. Refactor the code to use
channels, such that it is simpler to follow.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@bboozzoo bboozzoo force-pushed the bboozzoo/snap-run-wait-systemd-group-job-done branch from 9118ef9 to f7f4f8f Compare September 29, 2021 14:10
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Copy link
Contributor

@mardy mardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just please have a look at the question inline, but it may be that my concern is not founded.

// only in tests
continue
}
if sig.Name != "org.freedesktop.systemd1.Manager.JobRemoved" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But is this check still needed then?

sandbox/cgroup/tracking.go Outdated Show resolved Hide resolved
@anonymouse64 anonymouse64 self-requested a review October 1, 2021 15:39
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things are a bit unclear to me whether they are correct or the easiest/simplest way to approach them, but overall I think it is the right direction to move in instead of blindly doing the busy loop like we were before, so thanks for improving that, and thanks for improving the dbustest package in this way

if err := conn.AddMatchSignal(jobRemoveMatch...); err != nil {
return fmt.Errorf("cannot subscribe to systemd signals: %v", err)
}
signals := make(chan *dbus.Signal, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's special about the number 10 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making some place for signals in case there's a lot of things going on in systemd.

@@ -284,6 +296,63 @@ var doCreateTransientScope = func(conn *dbus.Conn, unitName string, pid int) err
properties,
aux,
)
var wg sync.WaitGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment here about the flow we are performing would be nice to see, essentially walking through the various steps of the process and the order we expect them to happen in and possibly even pointers to where in systemd-run this code was inspire from would be really helpful in verifying correctness of the behavior here

expectedJob := dbus.ObjectPath("")
for {
select {
case job, ok := <-jobWaitFor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like one thing that is not clear from this code is how many times we expect to hit this codepath, how many jobs do we expect to be sent on jobWaitFor? it seems like just one?

could we instead just block normally to receive the job on jobWaitFor channel first and then go into the infinite for loop getting results from the signals channel? that would simplify a bit understanding the flow of things here across routines etc a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I don't think there is a guarantee that by the time Call/Store() finish, the signal couldn't be sent by systemd already. We could workaround that by having a non-0 buffer, but we still have to inspect the job path or unit, as there may have been other jobs in progress,

Comment on lines 301 to 305
defer func() {
close(closeChan)
wg.Wait()
}()
wg.Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear this code is necessary so we don't leak a go routine when/if we have to return an error from this overall function after we have started the go routine ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just to make sure that it's gone as we return.

}

func (s *testDBusStream) decodeRequest() {
func (s *testDBusStream) decodeRequest(req []byte) {
buf := bytes.NewBuffer(req)
// s.m is locked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments about s.m are no longer needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -34,10 +34,10 @@ const testDBusClientName = ":test"
// DBusHandlerFunc is the type of handler function for interacting with test DBus.
//
// The handler is called for each message that arrives to the bus from the test
// client. The handler can respond by returning zero or more messages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would have been kinda nice to have these changes in a separate PR but I understand that's not quite how things evolved

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will likely be blocked, so I can look into extracting the relevant bits to a separate PR

conn, err := dbus.NewConn(newTestDBusStream(handler))
type InjectMessageFunc func(msg *dbus.Message)

func InjectableConnection(handler DBusHandlerFunc) (*dbus.Conn, InjectMessageFunc, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could use a doc-comment, especially an example of what situation one would want to use InjectableConnection() for instead of Connection()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@bboozzoo
Copy link
Collaborator Author

bboozzoo commented Oct 5, 2021

Per standup discussion, I'll push a workaround to use the new code path always on cgroup v2 systems, but leave it as it is on v1. Added a blocked label for now.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Some distros use an older and broken systemd, where creating user scopes always
fails. Fortunately, those systems also use cgroup v1. On cgroup v2 we absolutely
need to have a scope, as otherwise we risk manipulating the wrong cgroup. It so
happens, that v2 systems also use a newer version of systemd which can create
user scopes.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
The spread tests use that line as a canary. Make sure that it pops up for both
v2 and v1 code paths.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…and 21.10

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…ude systems

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@anonymouse64 anonymouse64 self-requested a review October 26, 2021 12:09
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
for {
select {
case job, ok := <-jobWaitFor:
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reach !ok case for this channel? This would normally happen if it was explicitly closed but we are not doing this. Could we close it to signify end of processing and get rid of closeChan (moving the logic of conn.RemoveSignal.. here)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tweaked the code a bit, also noticed that the signals channel isn't really closed so I fixed that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes.

// establishing a device cgroup filtering in the wrong group
return doCreateTransientScopeJobRemovedSync(conn, unitName, pid)
}
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove // ?

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Copy link
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel logic looks fine, a few minor suggestions. Thanks

expectedJob = job
}
case sig, ok := <-signals:
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never reach !ok since we close(signals) and exit the select above, but it's fine to have an extra check; might be worth a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also pass the signals channel as an argument to go-dbus. I saw some code there that closes he signals channel in the cleanup path.

for {
select {
case job, ok := <-jobWaitFor:
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes.

if result != "done" {
return fmt.Errorf("transient scope could not be started, job %v finished with result %v", job, result)
}
case <-timeout.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, maybe just case <-time.After(...): and then timeout := time.NewTimer.. is not needed?

Thanks to @stolowski for the suggestion

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@bboozzoo
Copy link
Collaborator Author

@mvo5 this is ready to land now, can you merge it?

@mvo5 mvo5 merged commit 547339e into snapcore:master Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants