Skip to content

Commit

Permalink
mm: multigenerational lru: documentation
Browse files Browse the repository at this point in the history
Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
  • Loading branch information
yuzhaogoogle authored and xanmod committed Aug 31, 2021
1 parent 5b37aed commit fd0956e
Show file tree
Hide file tree
Showing 2 changed files with 135 additions and 0 deletions.
1 change: 1 addition & 0 deletions Documentation/vm/index.rst
Expand Up @@ -17,6 +17,7 @@ various features of the Linux memory management

swap_numa
zswap
multigen_lru

Kernel developers MM documentation
==================================
Expand Down
134 changes: 134 additions & 0 deletions Documentation/vm/multigen_lru.rst
@@ -0,0 +1,134 @@
.. SPDX-License-Identifier: GPL-2.0
=====================
Multigenerational LRU
=====================

Quick Start
===========
Build Configurations
--------------------
:Required: Set ``CONFIG_LRU_GEN=y``.

:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
default.

Runtime Configurations
----------------------
:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
feature was not turned on by default.

:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to
protect the working set of ``N`` milliseconds. The OOM killer is
invoked if this working set cannot be kept in memory.

:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature
is turned on. This file has the following output:

::

memcg memcg_id memcg_path
node node_id
min_gen birth_time anon_size file_size
...
max_gen birth_time anon_size file_size

``min_gen`` is the oldest generation number and ``max_gen`` is the
youngest generation number. ``birth_time`` is in milliseconds.
``anon_size`` and ``file_size`` are in pages.

Phones/Laptops/Workstations
---------------------------
No additional configurations required.

Servers/Data Centers
--------------------
:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a
larger number.

:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger
number.

:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``.

:Working set estimation: Write ``+ memcg_id node_id max_gen
[swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging,
which scans PTEs for accessed pages and then creates the next
generation ``max_gen+1``. A swap file and a non-zero ``swappiness``,
which overrides ``vm.swappiness``, are required to scan PTEs mapping
anon pages.

:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness]
[nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
eviction, which evicts generations less than or equal to ``min_gen``.
``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
``max_gen-1`` are not fully aged and therefore cannot be evicted.
``nr_to_reclaim`` can be used to limit the number of pages to evict.
Multiple command lines are supported, so does concatenation with
delimiters ``,`` and ``;``.

Framework
=========
For each ``lruvec``, evictable pages are divided into multiple
generations. The youngest generation number is stored in
``lrugen->max_seq`` for both anon and file types as they are aged on
an equal footing. The oldest generation numbers are stored in
``lrugen->min_seq[2]`` separately for anon and file types as clean
file pages can be evicted regardless of swap and writeback
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into
``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
``page->flags``. The sliding window technique is used to prevent
truncated generation numbers from overlapping. Each truncated
generation number is an index to an array of per-type and per-zone
lists ``lrugen->lists``.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed ``N`` times
via file descriptors belong to tier ``order_base_2(N)``. Each
generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they
require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``.
In contrast to moving across generations which requires list
operations, moving across tiers only involves operations on
``page->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refault rates of all tiers
and decides when to protect pages from which tiers.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

Aging
-----
The aging produces young generations. Given an ``lruvec``, the aging
traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()``
to scan PTEs for accessed pages (a ``mm_struct`` list is maintained
for each ``memcg``). Upon finding one, the aging updates its
generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``).
After each round of traversal, the aging increments ``max_seq``. The
aging is due when both ``min_seq[2]`` have caught up with
``max_seq-1``.

Eviction
--------
The eviction consumes old generations. Given an ``lruvec``, the
eviction scans pages on the per-zone lists indexed by anon and file
``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to
select a type based on the values of ``min_seq[2]``. If they are
equal, it selects the type that has a lower refault rate. The eviction
sorts a page according to its updated generation number if the aging
has found this page accessed. It also moves a page to the next
generation if this page is from an upper tier that has a higher
refault rate than the base tier. The eviction increments
``min_seq[2]`` of a selected type when it finds all the per-zone lists
indexed by ``min_seq[2]`` of this selected type are empty.

To-do List
==========
KVM Optimization
----------------
Support shadow page table walk.

NUMA Optimization
-----------------
Optimize page table walk for NUMA.

0 comments on commit fd0956e

Please sign in to comment.