Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/snap/quota: refactor quota CLI as per new design #10333

Conversation

anonymouse64
Copy link
Member

@anonymouse64 anonymouse64 commented Jun 1, 2021

  • Make quota command only responsible for showing/displaying quota group
    information.
  • Introduce new set-quota command which creates or updates quota groups
  • Move the display of memory limit under a map section "constraints" which then
    has a memory key to allow future resource types to live under this section.
  • Display current memory usage for quota groups under a new section "current",
    which like "constraints" can be expanded for future resource types too.

Also update the spread test to check that memory usage reported by snapd is
approximately that of what the kernel reports.

THE SPREAD TESTS ARE FINALLY HAPPY 🎉 🦖 🎉 🌮 :shipit:

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
* Make quota command only responsible for showing/displaying quota group
  information.
* Introduce new set-quota command which creates or updates quota groups
* Move the display of memory limit under a map section "constraints" which then
  has a memory key to allow future resource types to live under this section.
* Display current memory usage for quota groups under a new section "current",
  which like "constraints" can be expanded for future resource types too.

Also update the spread test and unit tests for the new output formats.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
@anonymouse64 anonymouse64 force-pushed the feature/quota-groups-remastered-deluxe-diamond-edition-2 branch 3 times, most recently from 3de904f to 13aa951 Compare June 2, 2021 00:34
This too is a race condition, thanks to Maciej for pointing this out.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
…kernel

This check ensures that what the kernel says the current memory usage is and
what snapd says the current memory usage is don't differ by more than 10%. In
practice they should be exactly equal if the program is doing nothing as the
go-example-webserver should be, but something does happen to change it \
shouldn't change drastically.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
@anonymouse64 anonymouse64 force-pushed the feature/quota-groups-remastered-deluxe-diamond-edition-2 branch from 13aa951 to d4139c9 Compare June 2, 2021 01:42
@pedronis pedronis self-requested a review June 2, 2021 07:58
This ensures that we can check the `snap quota` and `snap quotas` output for
the "current" section with the memory usage in it.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
These limits were too low so the service was killed and no memory usage was
reported in the output, but we want to see some usage from the output to check
that it works, so increase the limits so the server is not killed due to OOM.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Also fix the defer statement which was missing the snap to unset the config on.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
…roups

We only should be checking what snapd says about memory usage for these empty
groups, we don't need to compare with what the kernel says.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
…ystems

On the following systems:

- Arch Linux
- Fedora 33
- Fedora 34
- Debian sid
- Ubuntu 21.04

The memory usage of an "empty" but active cgroup ends up being 4K for some
reason, so handle this in the expected output.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Copy link
Contributor

@mvo5 mvo5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks very good. One small idea about the test but fine for a followup.


var shortSetQuotaHelp = i18n.G(`Create or update a quota group.`)
var longSetQuotaHelp = i18n.G(`
The set-quota command updates or creates a quota group with the specified set of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this very detailed description!

@@ -93,7 +85,7 @@ func getQuotaGroups(c *Command, r *http.Request, _ *auth.UserState) Response {
}
sort.Strings(names)

results := make([]quotaGroupResultJSON, len(quotas))
results := make([]client.QuotaGroupResult, len(quotas))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this


percentChg="$(python3 -c "import math; print(math.ceil(abs($snapdSaysMemUsage - $kernelSaysMemUsage) / $snapdSaysMemUsage * 100))")"

if [ "$percentChg" -gt 10 ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit funny that the margin (needs to be) this big but that's fine of course

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just kinda random picked 10% as a reasonable value, not much reason to it. Though also, it's probably a bit artificial given that systemd queries the exact same bit of information from the kernel that we are checking here, but I wanted to avoid the race anyways by giving it a margin of error here. We can start with a smaller margin of error if you like and increase it if we see it fail with differences above the margin.

exit 1
fi

snapdSaysMemUsage="$(sudo snap run http --body GET snapd:///v2/quotas/group-four | jq -r '.result."current-memory"')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a test here that checks that the size that the groups reports is roughly in line with /proc/$(pidof go-webserver)/status | grep VmSize or VmRSS or similar. Not sure what exactly we need to correlate there though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer to avoid this because memory accounting at this level is rather complicated and there are too many different values that are roughly "how much memory is this thing using", and I fear on different kernels we may see different values reported by systemd for the group versus the individual process. I'm already a bit frustrated with writing this bit of the spread test as it's all very finicky and prone to silly errors requiring re-runs of the spread test.

ccc aaa memory=400B
ddd aaa memory=400B
bbb zzz memory=1000B memory=400B
`[1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick!

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
non capturing groups are not a thing in grep -E because of course they aren't

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
@anonymouse64 anonymouse64 added the Squash-merge Please squash this PR when merging. label Jun 2, 2021
Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
@anonymouse64
Copy link
Member Author

I'm like 80% sure that the spread tests are 65% likelier to be happy now than they were before, with confidence level of 2%

@anonymouse64
Copy link
Member Author

somehow the quota groups without any services in them are causing oom message 🙃

Copy link
Contributor

@mardy mardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Just a note, mainly for future discussions: I think that this is one of those cases in which the spread tests should be allowed to use a mocked systemd: this would guarantee consistent behaviour and make it possible to easily test more corner cases.

Or actually: it's fine for the spread tests to continue be this way, since they ensure that our functionality works fine across all distros; but I see that we do not only use them for that, but also for testing stuff which could be more easily (and more reliably, and faster!) tested at component level. So, maybe we should start considering the idea of stripping the spread tests to that bare minimum which gives us confidence that every features is supported in every distro, and create a new testing level for functional tests (pytest could be a good candidate for them), where we run our components unmodified, but with everything mocked around them.


All snaps provided are appended to the group; to remove a snap from a
quota group the entire group must be removed with remove-quota and recreated
without the quota group. To remove a sub-group from the quota group, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"without the quota group" -> "without the undesired snap"

Unless, of course, I misunderstood something :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch this was a typo. I will adjust the wording but not in this PR since it is finally green and I would love to land it :-)

Also this reminds me when @degville is back it would be great to have him look at the help texts here as well.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

…least 4K

On some systems, an empty cgroup with just cgroups nested inside it (but not
necessarily any processes) will have 4K memory usage, so on these systems, we
should make sure that the nested cgroups have enough space in them.

In a follow-up, we will adjust the minimum usable memory limit for a given
cgroup to be 4K to prevent this situation from happening in practice.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
@anonymouse64
Copy link
Member Author

after much ado, the spread tests here are finally green (well at least the snap-quota ones are green, there are some unrelated failures)

@anonymouse64
Copy link
Member Author

I think that this is one of those cases in which the spread tests should be allowed to use a mocked systemd: this would guarantee consistent behaviour and make it possible to easily test more corner cases.

@mardy As frustrating as it is to get these spread tests to work for all the corner cases on the different distros we support, I do actually think it's quite important we test with the real systemd's from real distros in the spread tests, since the sorts of bugs and weirdness we are seeing in spread tests here is actually indeed the same kind of weirdness that real users would see so it's important we iron out the experience and using the real systemd with all its faults and features is important for this. I will be the first to tell you how cool it is that we run so thorough spread tests on so many tests that are as real as we can get essentially, even though I will also be the first to loudly complain about how annoying these spread tests can be 😄

@anonymouse64
Copy link
Member Author

The only spread failures here are google:debian-10-64:tests/main/interfaces-many-core-provided and google:ubuntu-18.04-64:tests/regression/lp-1867193 (and all the tumbleweed ones that have been failing since forever), which are unrelated so this PR should be good to go in now if @pedronis wants to force merge it in the AM

Copy link
Contributor

@mardy mardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations, LGTM! :-)

@mvo5 mvo5 merged commit d750b7b into snapcore:master Jun 7, 2021
@anonymouse64 anonymouse64 deleted the feature/quota-groups-remastered-deluxe-diamond-edition-2 branch June 7, 2021 12:51
pedronis added a commit that referenced this pull request Jun 18, 2021
Merge pull request #10346 from anonymouse64/feature/quota-groups-remastered-deluxe-diamond-edition-2.5

This is to support nesting and to avoid confusing situations like being able to
create empty but nested quota groups which trigger oom-killer to be invoked on
the empty quota group because newer systems require at least 4K for the
accounting of a sub-group.

This came about during investigations of the spread test failures in #10333.
anonymouse64 added a commit to anonymouse64/snapd that referenced this pull request Jan 11, 2022
As requested by Alberto a long time ago: snapcore#10333 (comment)

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
anonymouse64 added a commit to anonymouse64/snapd that referenced this pull request Jan 11, 2022
As requested by Alberto a long time ago: snapcore#10333 (comment)

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Squash-merge Please squash this PR when merging.
Projects
None yet
4 participants