Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
mm: multigenerational lru: documentation
Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
- Loading branch information
1 parent
5b37aed
commit fd0956e
Showing
2 changed files
with
135 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
.. SPDX-License-Identifier: GPL-2.0 | ||
===================== | ||
Multigenerational LRU | ||
===================== | ||
|
||
Quick Start | ||
=========== | ||
Build Configurations | ||
-------------------- | ||
:Required: Set ``CONFIG_LRU_GEN=y``. | ||
|
||
:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by | ||
default. | ||
|
||
Runtime Configurations | ||
---------------------- | ||
:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the | ||
feature was not turned on by default. | ||
|
||
:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to | ||
protect the working set of ``N`` milliseconds. The OOM killer is | ||
invoked if this working set cannot be kept in memory. | ||
|
||
:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature | ||
is turned on. This file has the following output: | ||
|
||
:: | ||
|
||
memcg memcg_id memcg_path | ||
node node_id | ||
min_gen birth_time anon_size file_size | ||
... | ||
max_gen birth_time anon_size file_size | ||
|
||
``min_gen`` is the oldest generation number and ``max_gen`` is the | ||
youngest generation number. ``birth_time`` is in milliseconds. | ||
``anon_size`` and ``file_size`` are in pages. | ||
|
||
Phones/Laptops/Workstations | ||
--------------------------- | ||
No additional configurations required. | ||
|
||
Servers/Data Centers | ||
-------------------- | ||
:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a | ||
larger number. | ||
|
||
:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger | ||
number. | ||
|
||
:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``. | ||
|
||
:Working set estimation: Write ``+ memcg_id node_id max_gen | ||
[swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging, | ||
which scans PTEs for accessed pages and then creates the next | ||
generation ``max_gen+1``. A swap file and a non-zero ``swappiness``, | ||
which overrides ``vm.swappiness``, are required to scan PTEs mapping | ||
anon pages. | ||
|
||
:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness] | ||
[nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the | ||
eviction, which evicts generations less than or equal to ``min_gen``. | ||
``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and | ||
``max_gen-1`` are not fully aged and therefore cannot be evicted. | ||
``nr_to_reclaim`` can be used to limit the number of pages to evict. | ||
Multiple command lines are supported, so does concatenation with | ||
delimiters ``,`` and ``;``. | ||
|
||
Framework | ||
========= | ||
For each ``lruvec``, evictable pages are divided into multiple | ||
generations. The youngest generation number is stored in | ||
``lrugen->max_seq`` for both anon and file types as they are aged on | ||
an equal footing. The oldest generation numbers are stored in | ||
``lrugen->min_seq[2]`` separately for anon and file types as clean | ||
file pages can be evicted regardless of swap and writeback | ||
constraints. These three variables are monotonically increasing. | ||
Generation numbers are truncated into | ||
``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into | ||
``page->flags``. The sliding window technique is used to prevent | ||
truncated generation numbers from overlapping. Each truncated | ||
generation number is an index to an array of per-type and per-zone | ||
lists ``lrugen->lists``. | ||
|
||
Each generation is then divided into multiple tiers. Tiers represent | ||
levels of usage from file descriptors only. Pages accessed ``N`` times | ||
via file descriptors belong to tier ``order_base_2(N)``. Each | ||
generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they | ||
require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``. | ||
In contrast to moving across generations which requires list | ||
operations, moving across tiers only involves operations on | ||
``page->flags`` and therefore has a negligible cost. A feedback loop | ||
modeled after the PID controller monitors refault rates of all tiers | ||
and decides when to protect pages from which tiers. | ||
|
||
The framework comprises two conceptually independent components: the | ||
aging and the eviction, which can be invoked separately from user | ||
space for the purpose of working set estimation and proactive reclaim. | ||
|
||
Aging | ||
----- | ||
The aging produces young generations. Given an ``lruvec``, the aging | ||
traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` | ||
to scan PTEs for accessed pages (a ``mm_struct`` list is maintained | ||
for each ``memcg``). Upon finding one, the aging updates its | ||
generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``). | ||
After each round of traversal, the aging increments ``max_seq``. The | ||
aging is due when both ``min_seq[2]`` have caught up with | ||
``max_seq-1``. | ||
|
||
Eviction | ||
-------- | ||
The eviction consumes old generations. Given an ``lruvec``, the | ||
eviction scans pages on the per-zone lists indexed by anon and file | ||
``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to | ||
select a type based on the values of ``min_seq[2]``. If they are | ||
equal, it selects the type that has a lower refault rate. The eviction | ||
sorts a page according to its updated generation number if the aging | ||
has found this page accessed. It also moves a page to the next | ||
generation if this page is from an upper tier that has a higher | ||
refault rate than the base tier. The eviction increments | ||
``min_seq[2]`` of a selected type when it finds all the per-zone lists | ||
indexed by ``min_seq[2]`` of this selected type are empty. | ||
|
||
To-do List | ||
========== | ||
KVM Optimization | ||
---------------- | ||
Support shadow page table walk. | ||
|
||
NUMA Optimization | ||
----------------- | ||
Optimize page table walk for NUMA. |