Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many: use changes + tasks for quota group operations #10420

Conversation

anonymouse64
Copy link
Member

@anonymouse64 anonymouse64 commented Jun 16, 2021

The exported methods from the servicestate package all return tasksets now,
which is expected to be put into changes that are executed by the overlord
loop. This involves changes in many parts that use quotas, such as the tests,
where a new mock function to create a quota group in state without running any
servicectl commands as well as changes to the client side.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a pass, found a theoretical issue but I don't think it relates to the spread tests issues, as afaict systemd uses the meters for real only for Stop/Restart

daemon/api_quotas.go Outdated Show resolved Hide resolved
daemon/api_quotas_test.go Outdated Show resolved Hide resolved
overlord/servicestate/quota_control.go Outdated Show resolved Hide resolved
overlord/servicestate/quota_handlers.go Outdated Show resolved Hide resolved
@anonymouse64
Copy link
Member Author

Unfortunately this is blocked now, after reviewing the spread test failures again, it seems that changes like those in 763dd01 are papering over a real issue.

The issue is that we now have a task which gets marked done before it is done doing things. It is okay in the case where we get rebooted in the middle after unlocking the state and come back and re-execute the task, it will not break anything, but we now have race conditions in the following situation (which is what the spread test is doing):

  1. quota-control task is executing
  2. snap set-quota ... command is looping waiting for the change with the quota-control task in it to be in the Done state
  3. quota-control task finishes running wrappers.EnsureSnapServices() and marks itself done, and releases the state lock
  4. the client now sees that the change is Done, so it returns success
  5. The client now assumes that the group is created/updated/removed and that all affected services have been updated, however they have not been and future ops executed immediately fail, such as checking that a service was restarted or killed (due to OOM with the low memory limit like in the test) or etc
  6. The quota-control task continues executing, since there is more code after wrappers.EnsureSnapServices() that needs to be executed, mainly that we restart the services with the state unlocked since restarting snap services is a snap-controlled time option (meaning it could take an arbitrary amount of time due to start-timeout and stop-timeout, etc.)
  7. If the client now tries to check / execute the things in 5) above, they pass since now the task is fully executed

There are a few ways we could solve this, but this is a bit of an unfortunate corner case. Options I can think of right now:

  1. Accept that life is hard and computers triply so and just adjust the spread test to check the things we want to check in a loop
  2. Introduce a new task type and execute the systemctl restart ... bits that we need to run with state unlocked in the second task, this ensures that the first task runs wrappers.EnsureSnapServices() and when it marks itself Done, it will not be run again, leading to problems. Then the change will need to wait for the whole thing to finish, but if the second task gets interrupted or executed again after a reboot it's not a big deal since the operations are idempotent anyways.
  3. Adjust tasks / changes in some other way so that we can mark the task as Done() but not mark the entire change as Done() until all the handler functions have actually returned, but this is a behavior change and may have other unintended effects across the codebase and also would need to be split out into it's own PR separate from this

@pedronis pedronis force-pushed the feature/quota-groups-the-prequel-spinoff-6 branch from 763dd01 to cc3c683 Compare June 25, 2021 12:39
@pedronis
Copy link
Collaborator

This is now rebased on #10458 which should solve the issues we had with the spread tests. At this point the two last commits cancel each other, I have both to keep history but could also be dropped.

anonymouse64 and others added 3 commits June 29, 2021 17:25
The exported methods from the servicestate package all return tasksets now,
which is expected to be put into changes that are executed by the overlord
loop. This involves changes in many parts that use quotas, such as the tests,
where a new mock function to create a quota group in state without running any
servicectl commands as well as changes to the client side.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
These checks are also race conditions too

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
if the quota-control tasks don't set themselves as done prematurely
@pedronis
Copy link
Collaborator

Conflict detection is implemented in #10471 based on this one

@pedronis pedronis marked this pull request as ready for review June 29, 2021 15:31
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but the new preconditions checks are not tested afaict

Comment on lines +102 to +107
return nil, fmt.Errorf("group %q already exists", name)
}

if memoryLimit == 0 {
return nil, fmt.Errorf("cannot create quota group with no memory limit set")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these new precond checks are not tested

// oom killer to be invoked when a new group is added as a sub-group to the
// larger group.
if memoryLimit <= 4*quantity.SizeKiB {
return nil, fmt.Errorf("memory limit for group %q is too small: size must be larger than 4KB", name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// executing this directly
// make sure the specified snaps exist and aren't currently in another group
if err := validateSnapForAddingToGroup(st, snaps, name, allGrps); err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 161 to 167
return nil, fmt.Errorf("cannot remove non-existent quota group %q", name)
}

// XXX: remove this limitation eventually
if len(grp.SubGroups) != 0 {
return nil, fmt.Errorf("cannot remove quota group with sub-groups, remove the sub-groups first")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// now ensure that all of the snaps mentioned in AddSnaps exist as snaps and
// that they aren't already in an existing quota group
if err := validateSnapForAddingToGroup(st, updateOpts.AddSnaps, name, allGrps); err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@pedronis pedronis self-assigned this Jun 30, 2021
Copy link
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One nitpick and a few test cases missing as pointed out by Samuele (and I think I found one more), but I'm fine if they are addressed in a followup not to block this PR (and also because it's already big).

client/quota.go Show resolved Hide resolved
// EnsureSnapServices, see comment in ensureSnapServicesForGroup for
// full details
if updateOpts.NewMemoryLimit < grp.MemoryLimit {
return nil, fmt.Errorf("cannot decrease memory limit of existing quota-group, remove and re-create it to decrease the limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this case isn't covered either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now pushed tests that cover most of these

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit adding extra tests looks good, thank you!

@pedronis pedronis removed their assignment Jul 1, 2021
Copy link
Member Author

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Pawel's suggestion, otherwise looks good to me, thanks for the additional tests and pre-req PR's that have already landed

client/quota.go Show resolved Hide resolved
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually vote

@anonymouse64 anonymouse64 merged commit 9dfdb37 into snapcore:master Jul 1, 2021
@anonymouse64 anonymouse64 deleted the feature/quota-groups-the-prequel-spinoff-6 branch July 1, 2021 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants