Skip to content

Commit

Permalink
add extra free kbytes tunable
Browse files Browse the repository at this point in the history
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.

[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.

Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
  • Loading branch information
Rik van Riel authored and steven676 committed Nov 10, 2013
1 parent a5a4f71 commit ff03ce6
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 7 deletions.
16 changes: 16 additions & 0 deletions Documentation/sysctl/vm.txt
Expand Up @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
- dirty_writeback_centisecs
- drop_caches
- extfrag_threshold
- extra_free_kbytes
- hugepages_treat_as_movable
- hugetlb_shm_group
- laptop_mode
Expand Down Expand Up @@ -168,6 +169,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500.

==============================================================

extra_free_kbytes

This parameter tells the VM to keep extra free memory between the threshold
where background reclaim (kswapd) kicks in, and the threshold where direct
reclaim (by allocating processes) kicks in.

This is useful for workloads that require low latency memory allocations
and have a bounded burstiness in memory allocations, for example a
realtime application that receives and transmits network traffic
(causing in-kernel memory allocations) with a maximum total message burst
size of 200MB may need 200MB of extra free memory to avoid direct reclaim
related latencies.

==============================================================

hugepages_treat_as_movable

This parameter is only useful when kernelcore= is specified at boot time to
Expand Down
9 changes: 9 additions & 0 deletions kernel/sysctl.c
Expand Up @@ -96,6 +96,7 @@ extern char core_pattern[];
extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int extra_free_kbytes;
extern int min_free_order_shift;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
Expand Down Expand Up @@ -1189,6 +1190,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = min_free_kbytes_sysctl_handler,
.extra1 = &zero,
},
{
.procname = "extra_free_kbytes",
.data = &extra_free_kbytes,
.maxlen = sizeof(extra_free_kbytes),
.mode = 0644,
.proc_handler = min_free_kbytes_sysctl_handler,
.extra1 = &zero,
},
{
.procname = "min_free_order_shift",
.data = &min_free_order_shift,
Expand Down
32 changes: 25 additions & 7 deletions mm/page_alloc.c
Expand Up @@ -189,9 +189,21 @@ static char * const zone_names[MAX_NR_ZONES] = {
"Movable",
};

/*
* Try to keep at least this much lowmem free. Do not allow normal
* allocations below this point, only high priority ones. Automatically
* tuned according to the amount of memory in the system.
*/
int min_free_kbytes = 1024;
int min_free_order_shift = 1;

/*
* Extra memory for the system to try freeing. Used to temporarily
* free memory, to make space for new workloads. Anyone can allocate
* down to the min watermarks controlled by min_free_kbytes above.
*/
int extra_free_kbytes = 0;

static unsigned long __meminitdata nr_kernel_pages;
static unsigned long __meminitdata nr_all_pages;
static unsigned long __meminitdata dma_reserve;
Expand Down Expand Up @@ -5165,6 +5177,7 @@ static void setup_per_zone_lowmem_reserve(void)
void setup_per_zone_wmarks(void)
{
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long pages_low = extra_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
Expand All @@ -5176,11 +5189,14 @@ void setup_per_zone_wmarks(void)
}

for_each_zone(zone) {
u64 tmp;
u64 min, low;

spin_lock_irqsave(&zone->lock, flags);
tmp = (u64)pages_min * zone->present_pages;
do_div(tmp, lowmem_pages);
min = (u64)pages_min * zone->present_pages;
do_div(min, lowmem_pages);
low = (u64)pages_low * zone->present_pages;
do_div(low, vm_total_pages);

if (is_highmem(zone)) {
/*
* __GFP_HIGH and PF_MEMALLOC allocations usually don't
Expand All @@ -5204,11 +5220,13 @@ void setup_per_zone_wmarks(void)
* If it's a lowmem zone, reserve a number of pages
* proportionate to the zone's size.
*/
zone->watermark[WMARK_MIN] = tmp;
zone->watermark[WMARK_MIN] = min;
}

zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) +
low + (min >> 2);
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) +
low + (min >> 1);
setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}
Expand Down Expand Up @@ -5306,7 +5324,7 @@ module_init(init_per_zone_wmark_min)
/*
* min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
* that we can call two helper functions whenever min_free_kbytes
* changes.
* or extra_free_kbytes changes.
*/
int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
Expand Down

0 comments on commit ff03ce6

Please sign in to comment.