Skip to content

Commit de53fd7

Browse files
Dave ChilukPeter Zijlstra
Dave Chiluk
authored and
Peter Zijlstra
committed
sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac99 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176 ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (kubernetes/kubernetes#67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac99 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
1 parent 139d025 commit de53fd7

File tree

3 files changed

+67
-83
lines changed

3 files changed

+67
-83
lines changed

Diff for: Documentation/scheduler/sched-bwc.rst

+60-14
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
99
specification of the maximum CPU bandwidth available to a group or hierarchy.
1010

1111
The bandwidth allowed for a group is specified using a quota and period. Within
12-
each given "period" (microseconds), a group is allowed to consume only up to
13-
"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
14-
group exceeds this limit (for that period), the tasks belonging to its
15-
hierarchy will be throttled and are not allowed to run again until the next
16-
period.
17-
18-
A group's unused runtime is globally tracked, being refreshed with quota units
19-
above at each period boundary. As threads consume this bandwidth it is
20-
transferred to cpu-local "silos" on a demand basis. The amount transferred
12+
each given "period" (microseconds), a task group is allocated up to "quota"
13+
microseconds of CPU time. That quota is assigned to per-cpu run queues in
14+
slices as threads in the cgroup become runnable. Once all quota has been
15+
assigned any additional requests for quota will result in those threads being
16+
throttled. Throttled threads will not be able to run again until the next
17+
period when the quota is replenished.
18+
19+
A group's unassigned quota is globally tracked, being refreshed back to
20+
cfs_quota units at each period boundary. As threads consume this bandwidth it
21+
is transferred to cpu-local "silos" on a demand basis. The amount transferred
2122
within each of these updates is tunable and described as the "slice".
2223

2324
Management
@@ -35,12 +36,12 @@ The default values are::
3536

3637
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
3738
bandwidth restriction in place, such a group is described as an unconstrained
38-
bandwidth group. This represents the traditional work-conserving behavior for
39+
bandwidth group. This represents the traditional work-conserving behavior for
3940
CFS.
4041

4142
Writing any (valid) positive value(s) will enact the specified bandwidth limit.
42-
The minimum quota allowed for the quota or period is 1ms. There is also an
43-
upper bound on the period length of 1s. Additional restrictions exist when
43+
The minimum quota allowed for the quota or period is 1ms. There is also an
44+
upper bound on the period length of 1s. Additional restrictions exist when
4445
bandwidth limits are used in a hierarchical fashion, these are explained in
4546
more detail below.
4647

@@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
5354
System wide settings
5455
--------------------
5556
For efficiency run-time is transferred between the global pool and CPU local
56-
"silos" in a batch fashion. This greatly reduces global accounting pressure
57-
on large systems. The amount transferred each time such an update is required
57+
"silos" in a batch fashion. This greatly reduces global accounting pressure
58+
on large systems. The amount transferred each time such an update is required
5859
is described as the "slice".
5960

6061
This is tunable via procfs::
@@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
9798
In case b) above, even though the child may have runtime remaining it will not
9899
be allowed to until the parent's runtime is refreshed.
99100

101+
CFS Bandwidth Quota Caveats
102+
---------------------------
103+
Once a slice is assigned to a cpu it does not expire. However all but 1ms of
104+
the slice may be returned to the global pool if all threads on that cpu become
105+
unrunnable. This is configured at compile time by the min_cfs_rq_runtime
106+
variable. This is a performance tweak that helps prevent added contention on
107+
the global lock.
108+
109+
The fact that cpu-local slices do not expire results in some interesting corner
110+
cases that should be understood.
111+
112+
For cgroup cpu constrained applications that are cpu limited this is a
113+
relatively moot point because they will naturally consume the entirety of their
114+
quota as well as the entirety of each cpu-local slice in each period. As a
115+
result it is expected that nr_periods roughly equal nr_throttled, and that
116+
cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
117+
118+
For highly-threaded, non-cpu bound applications this non-expiration nuance
119+
allows applications to briefly burst past their quota limits by the amount of
120+
unused slice on each cpu that the task group is running on (typically at most
121+
1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only
122+
applies if quota had been assigned to a cpu and then not fully used or returned
123+
in previous periods. This burst amount will not be transferred between cores.
124+
As a result, this mechanism still strictly limits the task group to quota
125+
average usage, albeit over a longer time window than a single period. This
126+
also limits the burst ability to no more than 1ms per cpu. This provides
127+
better more predictable user experience for highly threaded applications with
128+
small quota limits on high core count machines. It also eliminates the
129+
propensity to throttle these applications while simultanously using less than
130+
quota amounts of cpu. Another way to say this, is that by allowing the unused
131+
portion of a slice to remain valid across periods we have decreased the
132+
possibility of wastefully expiring quota on cpu-local silos that don't need a
133+
full slice's amount of cpu time.
134+
135+
The interaction between cpu-bound and non-cpu-bound-interactive applications
136+
should also be considered, especially when single core usage hits 100%. If you
137+
gave each of these applications half of a cpu-core and they both got scheduled
138+
on the same CPU it is theoretically possible that the non-cpu bound application
139+
will use up to 1ms additional quota in some periods, thereby preventing the
140+
cpu-bound application from fully using its quota by that same amount. In these
141+
instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
142+
decide which application is chosen to run, as they will both be runnable and
143+
have remaining quota. This runtime discrepancy will be made up in the following
144+
periods when the interactive application idles.
145+
100146
Examples
101147
--------
102148
1. Limit a group to 1 CPU worth of runtime::

Diff for: kernel/sched/fair.c

+7-65
Original file line numberDiff line numberDiff line change
@@ -4371,8 +4371,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
43714371

43724372
now = sched_clock_cpu(smp_processor_id());
43734373
cfs_b->runtime = cfs_b->quota;
4374-
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
4375-
cfs_b->expires_seq++;
43764374
}
43774375

43784376
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4394,8 +4392,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
43944392
{
43954393
struct task_group *tg = cfs_rq->tg;
43964394
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
4397-
u64 amount = 0, min_amount, expires;
4398-
int expires_seq;
4395+
u64 amount = 0, min_amount;
43994396

44004397
/* note: this is a positive sum as runtime_remaining <= 0 */
44014398
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4412,61 +4409,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
44124409
cfs_b->idle = 0;
44134410
}
44144411
}
4415-
expires_seq = cfs_b->expires_seq;
4416-
expires = cfs_b->runtime_expires;
44174412
raw_spin_unlock(&cfs_b->lock);
44184413

44194414
cfs_rq->runtime_remaining += amount;
4420-
/*
4421-
* we may have advanced our local expiration to account for allowed
4422-
* spread between our sched_clock and the one on which runtime was
4423-
* issued.
4424-
*/
4425-
if (cfs_rq->expires_seq != expires_seq) {
4426-
cfs_rq->expires_seq = expires_seq;
4427-
cfs_rq->runtime_expires = expires;
4428-
}
44294415

44304416
return cfs_rq->runtime_remaining > 0;
44314417
}
44324418

4433-
/*
4434-
* Note: This depends on the synchronization provided by sched_clock and the
4435-
* fact that rq->clock snapshots this value.
4436-
*/
4437-
static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
4438-
{
4439-
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
4440-
4441-
/* if the deadline is ahead of our clock, nothing to do */
4442-
if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
4443-
return;
4444-
4445-
if (cfs_rq->runtime_remaining < 0)
4446-
return;
4447-
4448-
/*
4449-
* If the local deadline has passed we have to consider the
4450-
* possibility that our sched_clock is 'fast' and the global deadline
4451-
* has not truly expired.
4452-
*
4453-
* Fortunately we can check determine whether this the case by checking
4454-
* whether the global deadline(cfs_b->expires_seq) has advanced.
4455-
*/
4456-
if (cfs_rq->expires_seq == cfs_b->expires_seq) {
4457-
/* extend local deadline, drift is bounded above by 2 ticks */
4458-
cfs_rq->runtime_expires += TICK_NSEC;
4459-
} else {
4460-
/* global deadline is ahead, expiration has passed */
4461-
cfs_rq->runtime_remaining = 0;
4462-
}
4463-
}
4464-
44654419
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
44664420
{
44674421
/* dock delta_exec before expiring quota (as it could span periods) */
44684422
cfs_rq->runtime_remaining -= delta_exec;
4469-
expire_cfs_rq_runtime(cfs_rq);
44704423

44714424
if (likely(cfs_rq->runtime_remaining > 0))
44724425
return;
@@ -4661,8 +4614,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
46614614
resched_curr(rq);
46624615
}
46634616

4664-
static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
4665-
u64 remaining, u64 expires)
4617+
static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
46664618
{
46674619
struct cfs_rq *cfs_rq;
46684620
u64 runtime;
@@ -4684,7 +4636,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
46844636
remaining -= runtime;
46854637

46864638
cfs_rq->runtime_remaining += runtime;
4687-
cfs_rq->runtime_expires = expires;
46884639

46894640
/* we check whether we're throttled above */
46904641
if (cfs_rq->runtime_remaining > 0)
@@ -4709,7 +4660,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
47094660
*/
47104661
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
47114662
{
4712-
u64 runtime, runtime_expires;
4663+
u64 runtime;
47134664
int throttled;
47144665

47154666
/* no need to continue the timer with no bandwidth constraint */
@@ -4737,8 +4688,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
47374688
/* account preceding periods in which throttling occurred */
47384689
cfs_b->nr_throttled += overrun;
47394690

4740-
runtime_expires = cfs_b->runtime_expires;
4741-
47424691
/*
47434692
* This check is repeated as we are holding onto the new bandwidth while
47444693
* we unthrottle. This can potentially race with an unthrottled group
@@ -4751,8 +4700,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
47514700
cfs_b->distribute_running = 1;
47524701
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
47534702
/* we can't nest cfs_b->lock while distributing bandwidth */
4754-
runtime = distribute_cfs_runtime(cfs_b, runtime,
4755-
runtime_expires);
4703+
runtime = distribute_cfs_runtime(cfs_b, runtime);
47564704
raw_spin_lock_irqsave(&cfs_b->lock, flags);
47574705

47584706
cfs_b->distribute_running = 0;
@@ -4834,8 +4782,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
48344782
return;
48354783

48364784
raw_spin_lock(&cfs_b->lock);
4837-
if (cfs_b->quota != RUNTIME_INF &&
4838-
cfs_rq->runtime_expires == cfs_b->runtime_expires) {
4785+
if (cfs_b->quota != RUNTIME_INF) {
48394786
cfs_b->runtime += slack_runtime;
48404787

48414788
/* we are under rq->lock, defer unthrottling using a timer */
@@ -4868,7 +4815,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
48684815
{
48694816
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
48704817
unsigned long flags;
4871-
u64 expires;
48724818

48734819
/* confirm we're still not at a refresh boundary */
48744820
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4886,7 +4832,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
48864832
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
48874833
runtime = cfs_b->runtime;
48884834

4889-
expires = cfs_b->runtime_expires;
48904835
if (runtime)
48914836
cfs_b->distribute_running = 1;
48924837

@@ -4895,11 +4840,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
48954840
if (!runtime)
48964841
return;
48974842

4898-
runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
4843+
runtime = distribute_cfs_runtime(cfs_b, runtime);
48994844

49004845
raw_spin_lock_irqsave(&cfs_b->lock, flags);
4901-
if (expires == cfs_b->runtime_expires)
4902-
lsub_positive(&cfs_b->runtime, runtime);
4846+
lsub_positive(&cfs_b->runtime, runtime);
49034847
cfs_b->distribute_running = 0;
49044848
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
49054849
}
@@ -5056,8 +5000,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
50565000

50575001
cfs_b->period_active = 1;
50585002
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
5059-
cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
5060-
cfs_b->expires_seq++;
50615003
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
50625004
}
50635005

Diff for: kernel/sched/sched.h

-4
Original file line numberDiff line numberDiff line change
@@ -335,8 +335,6 @@ struct cfs_bandwidth {
335335
u64 quota;
336336
u64 runtime;
337337
s64 hierarchical_quota;
338-
u64 runtime_expires;
339-
int expires_seq;
340338

341339
u8 idle;
342340
u8 period_active;
@@ -557,8 +555,6 @@ struct cfs_rq {
557555

558556
#ifdef CONFIG_CFS_BANDWIDTH
559557
int runtime_enabled;
560-
int expires_seq;
561-
u64 runtime_expires;
562558
s64 runtime_remaining;
563559

564560
u64 throttled_clock;

0 commit comments

Comments
 (0)