From 365b514e06c10223cb32980aaf1ba0de386c89a0 Mon Sep 17 00:00:00 2001 From: Masahito S Date: Sun, 7 Apr 2024 01:04:18 +0900 Subject: [PATCH] mm/vmscan: Add sysctl knobs for protecting the working set [le9uo-1.5] The kernel does not provide a way to protect the working set under memory pressure. A certain amount of anonymous and clean file pages is required by the userspace for normal operation. First of all, the userspace needs a cache of shared libraries and executable binaries. If the amount of the clean file pages falls below a certain level, then thrashing and even livelock can take place. The patch provides sysctl knobs for protecting the working set (anonymous and clean file pages) under memory pressure. == Multi-Gen LRU compatibility == le9uo 1.3 and above comes with a long-waited Multi-Gen LRU (MGLRU, orlru_gen) compatibility. It comes with the working set protection features like it has to the traditional LRU. Please be aware that there is an MGLRU-specific limitation. At the latest Linux kernel (version 6.7.5 at the time this is written), Multi-gen LRU lacks the ability to comply with the vm.swappiness sysctl knob like it was initially designed. Almost regardless of what value is put in vm.swappiness (as long as greater than 0), it seems to evict whatever it finds first. This behavior is coming from MGLRU's page-scanner design/implementation, and it causes to start to thrash much earlier and easier than the traditional LRU. MGLRU does rather temporal approach called min_ttl, but this design has another problem; it's much more difficult to estimate each system's optimal effective value than traditional LRU + le9's spacial approach, and when the value is out of the effective range, it easily results either in too early invocation of OOM killer, or thrashing. le9uo does not fix this issue, but greatly mitigates it so that these limitations due to MGLRU's design/implementation isn't a problem anymore. [1] https://github.com/firelzrd/le9uo/blob/main/le9uo_patches/stable/0001-linux6.6-le9uo-1.5.patch Signed-off-by: Alexandre Frade --- Documentation/admin-guide/sysctl/vm.rst | 72 +++++++++++ include/linux/mm.h | 8 ++ kernel/sysctl.c | 34 +++++ mm/Kconfig | 63 ++++++++++ mm/mm_init.c | 1 + mm/vmscan.c | 158 ++++++++++++++++++++++-- 6 files changed, 329 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 45ba1f4dc0048..5cc069c428379 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -25,6 +25,9 @@ files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: - admin_reserve_kbytes +- anon_min_ratio +- clean_low_ratio +- clean_min_ratio - compact_memory - compaction_proactiveness - compact_unevictable_allowed @@ -106,6 +109,67 @@ On x86_64 this is about 128MB. Changing this takes effect whenever an application requests memory. +anon_min_ratio +============== + +This knob provides *hard* protection of anonymous pages. The anonymous pages +on the current node won't be reclaimed under any conditions when their amount +is below vm.anon_min_ratio. + +This knob may be used to prevent excessive swap thrashing when anonymous +memory is low (for example, when memory is going to be overfilled by +compressed data of zram module). + +Setting this value too high (close to 100) can result in inability to +swap and can lead to early OOM under memory pressure. + +The unit of measurement is the percentage of the total memory of the node. + +The default value is 15. + + +clean_low_ratio +================ + +This knob provides *best-effort* protection of clean file pages. The file pages +on the current node won't be reclaimed under memory pressure when the amount of +clean file pages is below vm.clean_low_ratio *unless* we threaten to OOM. + +Protection of clean file pages using this knob may be used when swapping is +still possible to + - prevent disk I/O thrashing under memory pressure; + - improve performance in disk cache-bound tasks under memory pressure. + +Setting it to a high value may result in a early eviction of anonymous pages +into the swap space by attempting to hold the protected amount of clean file +pages in memory. + +The unit of measurement is the percentage of the total memory of the node. + +The default value is 0. + + +clean_min_ratio +================ + +This knob provides *hard* protection of clean file pages. The file pages on the +current node won't be reclaimed under memory pressure when the amount of clean +file pages is below vm.clean_min_ratio. + +Hard protection of clean file pages using this knob may be used to + - prevent disk I/O thrashing under memory pressure even with no free swap space; + - improve performance in disk cache-bound tasks under memory pressure; + - avoid high latency and prevent livelock in near-OOM conditions. + +Setting it to a high value may result in a early out-of-memory condition due to +the inability to reclaim the protected amount of clean file pages when other +types of pages cannot be reclaimed. + +The unit of measurement is the percentage of the total memory of the node. + +The default value is 15. + + compact_memory ============== @@ -910,6 +974,14 @@ be 133 (x + 2x = 200, 2x = 133.33). At 0, the kernel will not initiate swap until the amount of free and file-backed pages is less than the high watermark in a zone. +This knob has no effect if the amount of clean file pages on the current +node is below vm.clean_low_ratio or vm.clean_min_ratio. In this case, +only anonymous pages can be reclaimed. + +If the number of anonymous pages on the current node is below +vm.anon_min_ratio, then only file pages can be reclaimed with +any vm.swappiness value. + unprivileged_userfaultfd ======================== diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f43..9e6731543f10f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -195,6 +195,14 @@ static inline void __mm_zero_struct_page(struct page *page) extern int sysctl_max_map_count; +extern bool sysctl_workingset_protection; +extern u8 sysctl_anon_min_ratio; +extern u8 sysctl_clean_low_ratio; +extern u8 sysctl_clean_min_ratio; +int vm_workingset_protection_update_handler( + struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); + extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d37130095aece..128a40e0b5dd0 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2236,6 +2236,40 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ZERO, }, #endif + { + .procname = "workingset_protection", + .data = &sysctl_workingset_protection, + .maxlen = sizeof(bool), + .mode = 0644, + .proc_handler = &proc_dobool, + }, + { + .procname = "anon_min_ratio", + .data = &sysctl_anon_min_ratio, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = &vm_workingset_protection_update_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE_HUNDRED, + }, + { + .procname = "clean_low_ratio", + .data = &sysctl_clean_low_ratio, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = &vm_workingset_protection_update_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE_HUNDRED, + }, + { + .procname = "clean_min_ratio", + .data = &sysctl_clean_min_ratio, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = &vm_workingset_protection_update_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE_HUNDRED, + }, { .procname = "user_reserve_kbytes", .data = &sysctl_user_reserve_kbytes, diff --git a/mm/Kconfig b/mm/Kconfig index 264a2df5ecf5b..4c21fdb6ec833 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -509,6 +509,69 @@ config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP bool +config ANON_MIN_RATIO + int "Default value for vm.anon_min_ratio" + depends on SYSCTL + range 0 100 + default 15 + help + This option sets the default value for vm.anon_min_ratio sysctl knob. + + The vm.anon_min_ratio sysctl knob provides *hard* protection of + anonymous pages. The anonymous pages on the current node won't be + reclaimed under any conditions when their amount is below + vm.anon_min_ratio. This knob may be used to prevent excessive swap + thrashing when anonymous memory is low (for example, when memory is + going to be overfilled by compressed data of zram module). + + Setting this value too high (close to MemTotal) can result in + inability to swap and can lead to early OOM under memory pressure. + +config CLEAN_LOW_RATIO + int "Default value for vm.clean_low_ratio" + depends on SYSCTL + range 0 100 + default 0 + help + This option sets the default value for vm.clean_low_ratio sysctl knob. + + The vm.clean_low_ratio sysctl knob provides *best-effort* + protection of clean file pages. The file pages on the current node + won't be reclaimed under memory pressure when the amount of clean file + pages is below vm.clean_low_ratio *unless* we threaten to OOM. + Protection of clean file pages using this knob may be used when + swapping is still possible to + - prevent disk I/O thrashing under memory pressure; + - improve performance in disk cache-bound tasks under memory + pressure. + + Setting it to a high value may result in a early eviction of anonymous + pages into the swap space by attempting to hold the protected amount + of clean file pages in memory. + +config CLEAN_MIN_RATIO + int "Default value for vm.clean_min_ratio" + depends on SYSCTL + range 0 100 + default 15 + help + This option sets the default value for vm.clean_min_ratio sysctl knob. + + The vm.clean_min_ratio sysctl knob provides *hard* protection of + clean file pages. The file pages on the current node won't be + reclaimed under memory pressure when the amount of clean file pages is + below vm.clean_min_ratio. Hard protection of clean file pages using + this knob may be used to + - prevent disk I/O thrashing under memory pressure even with no free + swap space; + - improve performance in disk cache-bound tasks under memory + pressure; + - avoid high latency and prevent livelock in near-OOM conditions. + + Setting it to a high value may result in a early out-of-memory condition + due to the inability to reclaim the protected amount of clean file pages + when other types of pages cannot be reclaimed. + config HAVE_MEMBLOCK_PHYS_MAP bool diff --git a/mm/mm_init.c b/mm/mm_init.c index 77fd04c83d046..5d6f3a4dbccd6 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2760,6 +2760,7 @@ static void __init mem_init_print_info(void) , K(totalhigh_pages()) #endif ); + printk(KERN_INFO "le9 Unofficial (le9uo) working set protection 1.5 by Masahito Suzuki (forked from hakavlad's original le9 patch)"); } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 078221bdf47a0..9f6abb36751ba 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -134,6 +134,15 @@ struct scan_control { /* The file folios on the current node are dangerously low */ unsigned int file_is_tiny:1; + /* The anonymous pages on the current node are below vm.anon_min_ratio */ + unsigned int anon_below_min:1; + + /* The clean file pages on the current node are below vm.clean_low_ratio */ + unsigned int clean_below_low:1; + + /* The clean file pages on the current node are below vm.clean_min_ratio */ + unsigned int clean_below_min:1; + /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; @@ -183,6 +192,15 @@ struct scan_control { #define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0) #endif +bool sysctl_workingset_protection __read_mostly = false; +u8 sysctl_anon_min_ratio __read_mostly = CONFIG_ANON_MIN_RATIO; +u8 sysctl_clean_low_ratio __read_mostly = CONFIG_CLEAN_LOW_RATIO; +u8 sysctl_clean_min_ratio __read_mostly = CONFIG_CLEAN_MIN_RATIO; +static u64 sysctl_anon_min_ratio_kb __read_mostly = 0; +static u64 sysctl_clean_low_ratio_kb __read_mostly = 0; +static u64 sysctl_clean_min_ratio_kb __read_mostly = 0; +static u64 workingset_protection_prev_totalram __read_mostly = 0; + /* * From 0 .. 200. Higher means more swappy. */ @@ -1752,6 +1770,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, folio_mapped(folio) && folio_test_referenced(folio)) goto keep_locked; + if (folio_is_file_lru(folio) ? sc->clean_below_min : sc->anon_below_min) + goto keep_locked; + /* * The number of dirty pages determines if a node is marked * reclaim_congested. kswapd will stall and start writing @@ -3071,6 +3092,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* + * Force-scan anon if clean file pages is under vm.clean_low_ratio + * or vm.clean_min_ratio. + */ + if (sc->clean_below_low || sc->clean_below_min) { + scan_balance = SCAN_ANON; + goto out; + } + /* * If there is enough inactive page cache, we do not reclaim * anything from the anonymous working right now. @@ -3215,6 +3245,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, BUG(); } + /* + * Hard protection of the working set. + * Don't reclaim anon/file pages when the amount is + * below the watermark of the same type. + */ + if (file ? sc->clean_below_min : sc->anon_below_min) + scan = 0; + nr[lru] = scan; } } @@ -4597,6 +4635,23 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc /* to protect the working set of the last N jiffies */ static unsigned long lru_gen_min_ttl __read_mostly; +static void do_invoke_oom(struct scan_control *sc, bool try_memcg) { + struct oom_control oc = { + .gfp_mask = sc->gfp_mask, + .order = sc->order, + }; + + if (try_memcg && mem_cgroup_oom_synchronize(true)) + return; + + if (!mutex_trylock(&oom_lock)) + return; + out_of_memory(&oc); + mutex_unlock(&oom_lock); +} +#define invoke_oom(sc) do_invoke_oom(sc, true) +#define invoke_oom_nomemcg(sc) do_invoke_oom(sc, false) + static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -4625,14 +4680,96 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * younger than min_ttl. However, another possibility is all memcgs are * either too small or below min. */ - if (mutex_trylock(&oom_lock)) { - struct oom_control oc = { - .gfp_mask = sc->gfp_mask, - }; + invoke_oom_nomemcg(sc); +} - out_of_memory(&oc); +int vm_workingset_protection_update_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret = proc_dou8vec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + workingset_protection_prev_totalram = 0; - mutex_unlock(&oom_lock); + return 0; +} + +static void prepare_workingset_protection(pg_data_t *pgdat, struct scan_control *sc) +{ + unsigned long node_mem_total; + struct sysinfo i; + + if (!(sysctl_workingset_protection)) { + sc->anon_below_min = 0; + sc->clean_below_low = 0; + sc->clean_below_min = 0; + return; + } + + if (likely(sysctl_anon_min_ratio || + sysctl_clean_low_ratio || + sysctl_clean_min_ratio)) { +#ifdef CONFIG_NUMA + si_meminfo_node(&i, pgdat->node_id); +#else //CONFIG_NUMA + si_meminfo(&i); +#endif //CONFIG_NUMA + node_mem_total = i.totalram; + + if (unlikely(workingset_protection_prev_totalram != node_mem_total)) { + sysctl_anon_min_ratio_kb = + node_mem_total * sysctl_anon_min_ratio / 100; + sysctl_clean_low_ratio_kb = + node_mem_total * sysctl_clean_low_ratio / 100; + sysctl_clean_min_ratio_kb = + node_mem_total * sysctl_clean_min_ratio / 100; + workingset_protection_prev_totalram = node_mem_total; + } + } + + /* + * Check the number of anonymous pages to protect them from + * reclaiming if their amount is below the specified. + */ + if (sysctl_anon_min_ratio) { + unsigned long reclaimable_anon; + + reclaimable_anon = + node_page_state(pgdat, NR_ACTIVE_ANON) + + node_page_state(pgdat, NR_INACTIVE_ANON) + + node_page_state(pgdat, NR_ISOLATED_ANON); + + sc->anon_below_min = reclaimable_anon < sysctl_anon_min_ratio_kb; + } else + sc->anon_below_min = 0; + + /* + * Check the number of clean file pages to protect them from + * reclaiming if their amount is below the specified. + */ + if (sysctl_clean_low_ratio || sysctl_clean_min_ratio) { + unsigned long reclaimable_file, dirty, clean; + + reclaimable_file = + node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE) + + node_page_state(pgdat, NR_ISOLATED_FILE); + dirty = node_page_state(pgdat, NR_FILE_DIRTY); + /* + * node_page_state() sum can go out of sync since + * all the values are not read at once. + */ + if (likely(reclaimable_file > dirty)) + clean = reclaimable_file - dirty; + else + clean = 0; + + sc->clean_below_low = clean < sysctl_clean_low_ratio_kb; + sc->clean_below_min = clean < sysctl_clean_min_ratio_kb; + } else { + sc->clean_below_low = 0; + sc->clean_below_min = 0; } } @@ -5141,6 +5278,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw */ if (!swappiness) type = LRU_GEN_FILE; + else if (sc->clean_below_min || sc->clean_below_low) + type = LRU_GEN_ANON; else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) type = LRU_GEN_ANON; else if (swappiness == 1) @@ -5150,7 +5289,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw else type = get_type_to_scan(lruvec, swappiness, &tier); - for (i = !swappiness; i < ANON_AND_FILE; i++) { + for (i = 0; i < ANON_AND_FILE; i++) { if (tier < 0) tier = get_tier_idx(lruvec, type); @@ -5425,6 +5564,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); + prepare_workingset_protection(pgdat, sc); mem_cgroup_calculate_protection(NULL, memcg); if (mem_cgroup_below_min(NULL, memcg)) @@ -6572,6 +6712,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) prepare_scan_count(pgdat, sc); + prepare_workingset_protection(pgdat, sc); + shrink_node_memcgs(pgdat, sc); flush_reclaim_state(sc); @@ -6660,6 +6802,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) */ if (reclaimable) pgdat->kswapd_failures = 0; + else if (sc->clean_below_min && !sc->priority) + invoke_oom(sc); } /*