Skip to content

Commit

Permalink
mm: multigenerational lru: documentation
Browse files Browse the repository at this point in the history
Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
  • Loading branch information
yuzhaogoogle authored and xanmod committed Jun 28, 2021
1 parent 7585161 commit 13254ef
Show file tree
Hide file tree
Showing 2 changed files with 144 additions and 0 deletions.
1 change: 1 addition & 0 deletions Documentation/vm/index.rst
Expand Up @@ -17,6 +17,7 @@ various features of the Linux memory management

swap_numa
zswap
multigen_lru

Kernel developers MM documentation
==================================
Expand Down
143 changes: 143 additions & 0 deletions Documentation/vm/multigen_lru.rst
@@ -0,0 +1,143 @@
.. SPDX-License-Identifier: GPL-2.0
=====================
Multigenerational LRU
=====================

Quick Start
===========
Build Options
-------------
:Required: Set ``CONFIG_LRU_GEN=y``.

:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
default.

:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
a maximum of ``X`` generations.

:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to
support a maximum of ``Y`` tiers per generation.

Runtime Options
---------------
:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
feature was not turned on by default.

:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
to spread pages out across ``N+1`` generations. ``N`` should be less
than ``X``. Larger values make the background aging more aggressive.

:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
This file has the following output:

::

memcg memcg_id memcg_path
node node_id
min_gen birth_time anon_size file_size
...
max_gen birth_time anon_size file_size

Given a memcg and a node, ``min_gen`` is the oldest generation
(number) and ``max_gen`` is the youngest. Birth time is in
milliseconds. The sizes of anon and file types are in pages.

Recipes
-------
:Android on ARMv8.1+: ``X=4``, ``Y=3`` and ``N=0``.

:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
``ARM64_HW_AFDBM``.

:Laptops and workstations running Chrome on x86_64: Use the default
values.

:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
generation ``max_gen`` and create the next generation ``max_gen+1``.
``gen`` should be equal to ``max_gen``. A swap file and a non-zero
``swappiness`` are required to scan anon type. If swapping is not
desired, set ``vm.swappiness`` to ``0``.

:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
[nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
generations less than or equal to ``gen``. ``gen`` should be less
than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active
generations and therefore protected from the eviction. Use
``nr_to_reclaim`` to limit the number of pages to evict. Multiple
command lines are supported, so does concatenation with delimiters
``,`` and ``;``.

Framework
=========
For each ``lruvec``, evictable pages are divided into multiple
generations. The youngest generation number is stored in ``max_seq``
for both anon and file types as they are aged on an equal footing. The
oldest generation numbers are stored in ``min_seq[2]`` separately for
anon and file types as clean file pages can be evicted regardless of
swap and write-back constraints. These three variables are
monotonically increasing. Generation numbers are truncated into
``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
``page->flags``. The sliding window technique is used to prevent
truncated generation numbers from overlapping. Each truncated
generation number is an index to an array of per-type and per-zone
lists. Evictable pages are added to the per-zone lists indexed by
``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
depending on their types.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). Each generation
contains at most CONFIG_TIERS_PER_GEN tiers, and they require
additional CONFIG_TIERS_PER_GEN-2 bits in page->flags. In contrast to
moving across generations which requires the lru lock for the list
operations, moving across tiers only involves an atomic operation on
``page->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors the refault rates across all
tiers and decides when to activate pages from which tiers in the
reclaim path.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

Aging
-----
The aging produces young generations. Given an ``lruvec``, the aging
scans page tables for referenced pages of this ``lruvec``. Upon
finding one, the aging updates its generation number to ``max_seq``.
After each round of scan, the aging increments ``max_seq``.

The aging maintains either a system-wide ``mm_struct`` list or
per-memcg ``mm_struct`` lists, and it only scans page tables of
processes that have been scheduled since the last scan.

The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``,
assuming both anon and file types are reclaimable.

Eviction
--------
The eviction consumes old generations. Given an ``lruvec``, the
eviction scans the pages on the per-zone lists indexed by either of
``min_seq[2]``. It first tries to select a type based on the values of
``min_seq[2]``. When anon and file types are both available from the
same generation, it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their new
generation numbers, if the aging has found them referenced. It also
moves pages from the tiers that have higher refault rates than tier 0
to the next generation.

When it finds all the per-zone lists of a selected type are empty, the
eviction increments ``min_seq[2]`` indexed by this selected type.

To-do List
==========
KVM Optimization
----------------
Support shadow page table scanning.

NUMA Optimization
-----------------
Optimize page table scan for NUMA.

0 comments on commit 13254ef

Please sign in to comment.