Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-oomd #15206

Merged
merged 14 commits into from
Oct 15, 2020
Merged

systemd-oomd #15206

merged 14 commits into from
Oct 15, 2020

Conversation

anitazha
Copy link
Member

@anitazha anitazha commented Mar 23, 2020

systemd-oomd follows the current oomd model more closely than what was discussed at DevConf.CZ: oomd will periodically ask pid1 over varlink for a list of cgroups to monitor and use that to decide what/when to kill. The candidates for killing are the children of the monitored cgroups: either the recursive cgroups of a cgroup with memory.oom.group = 1, or the leaf cgroups. oomd will do the actual killing and set xattrs appropriately so the system manager can use that to determine if processes were killed by oomd. However, this limits enabling oomd to slices owned by pid1. But it does allow oomd to kill delegated cgroups.

Copy link

@dschatzberg dschatzberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - see comments inline

src/oom/oomd-manager.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the actual business logic at all... Some initial comments.

src/basic/cgroup-util.h Show resolved Hide resolved
src/basic/cgroup-util.h Outdated Show resolved Hide resolved
src/shared/psi-util.c Outdated Show resolved Hide resolved
src/shared/psi-util.h Show resolved Hide resolved
src/test/test-psi-util.c Show resolved Hide resolved
src/oom/oomd-manager.c Outdated Show resolved Hide resolved
meson.build Show resolved Hide resolved
src/oom/oomd.c Show resolved Hide resolved
src/oom/oomd.c Outdated Show resolved Hide resolved
src/oom/oomd.c Outdated Show resolved Hide resolved
src/shared/psi-util.c Outdated Show resolved Hide resolved
@anitazha anitazha force-pushed the systoomd-v0 branch 3 times, most recently from 2e62d95 to 2185da1 Compare March 24, 2020 23:25
Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heya! sorry for the late review. still working though a lot of stuff queued from when i was in parental leave.

it's not a complete review, but a start.

I generally like a lot what I am seeing here, good work. the stuff i found is generally minor.

src/basic/cgroup-util.c Show resolved Hide resolved

r = safe_atou64(value, &v);
if (r < 0)
return r;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as extra paranoia step we could fail here with an error if v is CGROUP_LIMIT_MAX here, to avoid ambiguity? dunno?

src/basic/cgroup-util.c Outdated Show resolved Hide resolved
src/shared/psi-util.c Outdated Show resolved Hide resolved
src/shared/psi-util.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
src/oom/oomd-util.c Outdated Show resolved Hide resolved
src/oom/oomd-manager.c Outdated Show resolved Hide resolved
@poettering
Copy link
Member

oh, i see now that some of those helpers already did get merged? ignore that part of my review then

@poettering
Copy link
Member

please rebase! in particular as those helpers apparently got merged, after all

@poettering poettering added needs-rebase new-feature reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Apr 20, 2020
@anitazha
Copy link
Member Author

Hmm I thought I had already rebase-ed past the prior PR merges but I'll do that again when I make some updates. Thanks for the first round review

Copy link
Member Author

@anitazha anitazha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working through the comments and should have another update soon but just addressing some comments

src/shared/psi-util.h Outdated Show resolved Hide resolved
@@ -107,7 +119,8 @@
<citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd.scope</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd.special</refentrytitle><manvolnum>7</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd.directives</refentrytitle><manvolnum>7</manvolnum></citerefentry>
<citerefentry><refentrytitle>systemd.directives</refentrytitle><manvolnum>7</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
</para>
</refsect1>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a simplification for this first version of the code since that's how the FB Oomd is typically set up now. We intend to support scopes and services too in a later iteration (though I don't think the code will be different at all).

src/shared/psi-util.c Outdated Show resolved Hide resolved
src/shared/psi-util.c Outdated Show resolved Hide resolved
Copy link

@hixio-mh hixio-mh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull requested

@anitazha anitazha force-pushed the systoomd-v0 branch 8 times, most recently from 5a2215d to fa4b4d0 Compare June 23, 2020 01:01
@anitazha anitazha force-pushed the systoomd-v0 branch 4 times, most recently from ba7859b to e406de7 Compare June 26, 2020 17:28
@anitazha anitazha force-pushed the systoomd-v0 branch 2 times, most recently from 70d6f60 to a382d1c Compare July 10, 2020 21:38
@anitazha anitazha removed the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Oct 8, 2020
@anitazha
Copy link
Member Author

anitazha commented Oct 8, 2020

🤦 CI is greener now

@keszybz
Copy link
Member

keszybz commented Oct 13, 2020

I think we have crossed the point where subsequent versions of this PR are less useful than merging it and doing follow-up PRs.

I would propose the following plan:

  • merge this before v247-rc1
  • make oomd opt-in (i.e. default to disabled) and only available with -Dmode=developer.
  • describe the daemon and tools in NEWS, but specify that it's a preview and should not be used for production and that the options names and API details are subject to change.

Two reasons for this:

  • we want to release v247 soon and we won't have time for proper testing and responding to feedback,
  • it seems likely that some adjustments will be needed, and eschewing backwards-compatibility will allow us to make a better daemon in the long term.

I scanned the patches again, and it looks all nice. Unless @poettering has any last-minute requests, let's go ahead with the above.

systemd-oomd can be enabled when in developer mode (-Dmode=developer)
@anitazha
Copy link
Member Author

I pushed one commit to disable systemd-oomd by default and allow it to be enabled only in developer mode as suggested. I think we can add a NEWS entry in another PR.

@keszybz keszybz merged commit 69c0807 into systemd:master Oct 15, 2020
@keszybz
Copy link
Member

keszybz commented Oct 15, 2020

Yay!

@Conan-Kudo
Copy link
Contributor

Woo!

@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s ManagedOOMMemoryPressure = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s ManagedOOMMemoryPressureLimitPercent = '...';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so in the other cases where we expose scaled values (i.e. percent) we encode things on the wire in uint64_t values that are normalized to UINT32_MAX as 100%. i.e. 150% is mapped to be 6442450944 on the wire, 100% is 4294967296 and 50% is 2147483648. I think we should follow the same logic here.

See the MemoryMinScale/MemoryLowScale/TaskMaxScale properties for exampe.

hence, please rework this to be uint64_t, also use "Scale" as suffix in the name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was looking to switching this to "Scale" but it doesn't seem applicable in this case. For the existing properties that are suffixed with "Scale", the percentages are translated to a final value, i.e. number of bytes or number of tasks. But MemoryPressureLimitPercent= is operated on solely as a percentage. PSI averages are shown as a ratio in percent so there is no translation to a whole number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it's not really percent though, is it? it has two digits after the dot, i.e. its in units of 1/10,000th, displayed with a dot in the middle, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the PSI average is displayed as a decimal but it represents a percent of time, per https://www.kernel.org/doc/html/latest/accounting/psi.html.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not getting what you are trying to say. the dbus interface as merged now accepts percent ranges 0%…100% as integer, i.e. 1% granularity. The kernel otoh exports 0.00%…100.00% as fixed point decimal integer with 0.01% granularity, correct? Thus the D-Bus API loses two decimals of accuracy, no?

I still think we should expose this as 64bit value, with binary fixed point normalized to 2^32. It's more finegrained than the kernel accuracy, but accurate enough to not lose any accuracy compared to what the kernel exposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now. Keep the finer level of granularity. Sure I can do that.

man/org.freedesktop.systemd1.xml Show resolved Hide resolved
src/core/core-varlink.c Show resolved Hide resolved
src/core/core-varlink.c Show resolved Hide resolved
src/core/core-varlink.c Show resolved Hide resolved
src/core/load-fragment-gperf.gperf.m4 Show resolved Hide resolved
src/core/load-fragment.c Show resolved Hide resolved
src/core/load-fragment.c Show resolved Hide resolved
src/core/unit.c Show resolved Hide resolved
src/oom/oomd-manager.c Show resolved Hide resolved
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
Suggested post-merge in systemd#15206
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
@anitazha anitazha mentioned this pull request Oct 19, 2020
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
Suggested post-merge in systemd#15206
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
anitazha added a commit to anitazha/systemd that referenced this pull request Oct 19, 2020
lqb pushed a commit to lqb/systemd that referenced this pull request Oct 29, 2020
Currently, if an event source callback returns an error, we'll disable
the event source and continue. This adds a per-event source flag that if
turned on goes further: the event loop is also exited, propagating the
error code.

This is inspired by some patterns repeatedly seen in systemd#15206.

The idea is that event sources that server the "primary" function of a
program are marked like this, so that if they fail the failure is
instantly propagated and terminates the program.
lqb pushed a commit to lqb/systemd that referenced this pull request Oct 29, 2020
Suggested post-merge in systemd#15206
lqb pushed a commit to lqb/systemd that referenced this pull request Oct 29, 2020
lqb pushed a commit to lqb/systemd that referenced this pull request Oct 29, 2020
@anitazha anitazha deleted the systoomd-v0 branch November 18, 2020 22:50
anitazha added a commit to anitazha/systemd that referenced this pull request Feb 3, 2021
Requested in
systemd#15206 (comment),
preserve the full granularity for memory pressure limits (permyriad)
instead of capping out at percent.
Comment on lines +50 to +55
<para>The system must be running systemd with a full unified cgroup hierarchy for the expected cgroups-v2 features.
Furthermore, resource accounting must be turned on for all units monitored by <command>systemd-oomd</command>.
The easiest way to turn on resource accounting is by ensuring the values for <varname>DefaultCPUAccounting</varname>,
<varname>DefaultIOAccounting</varname>, <varname>DefaultMemoryAccounting</varname>, and
<varname>DefaultTasksAccounting</varname> are set to <constant>true</constant> in
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? Looking at the code, only the memory controller seems to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue opened: #19331

Yamakuzure pushed a commit to elogind/elogind that referenced this pull request Aug 25, 2021
Requested in
systemd/systemd#15206 (comment),
preserve the full granularity for memory pressure limits (permyriad)
instead of capping out at percent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet