Skip to content

Commit

Permalink
mm: multi-gen LRU: minimal implementation
Browse files Browse the repository at this point in the history
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not involve any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
folio->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[40, 42]%
                IOPS         BW
      5.18-rc1: 2463k        9621MiB/s
      patch1-6: 3484k        13.3GiB/s

  Single workload:
    memcached (anon): +[44, 46]%
                Ops/sec      KB/sec
      5.18-rc1: 771403.27    30004.17
      patch1-6: 1120643.70   43588.06

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-rc1
      40.53%  page_vma_mapped_walk
      20.37%  lzo1x_1_do_compress (real work)
       6.99%  do_raw_spin_lock
       3.93%  _raw_spin_unlock_irq
       2.08%  vma_interval_tree_subtree_search
       2.06%  vma_interval_tree_iter_next
       1.95%  folio_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.51%  ptep_clear_flush
       1.35%  __anon_vma_interval_tree_subtree_search

    patch1-6
      35.99%  lzo1x_1_do_compress (real work)
      19.40%  page_vma_mapped_walk
       6.31%  _raw_spin_unlock_irq
       3.95%  do_raw_spin_lock
       2.39%  anon_vma_interval_tree_iter_first
       2.25%  ptep_clear_flush
       1.92%  __anon_vma_interval_tree_subtree_search
       1.70%  folio_referenced_one
       1.68%  __zram_bvec_write
       1.43%  anon_vma_interval_tree_iter_next

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
  • Loading branch information
yuzhaogoogle authored and xanmod committed May 30, 2022
1 parent 82443b0 commit 7169860
Show file tree
Hide file tree
Showing 8 changed files with 1,034 additions and 10 deletions.
36 changes: 36 additions & 0 deletions include/linux/mm_inline.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,33 @@ static inline int lru_gen_from_seq(unsigned long seq)
return seq % MAX_NR_GENS;
}

static inline int lru_hist_from_seq(unsigned long seq)
{
return seq % NR_HIST_GENS;
}

static inline int lru_tier_from_refs(int refs)
{
VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));

/* see the comment in folio_lru_refs() */
return order_base_2(refs + 1);
}

static inline int folio_lru_refs(struct folio *folio)
{
unsigned long flags = READ_ONCE(folio->flags);
bool workingset = flags & BIT(PG_workingset);

/*
* Return the number of accesses beyond PG_referenced, i.e., N-1 if the
* total number of accesses is N>1, since N=0,1 both map to the first
* tier. lru_tier_from_refs() will account for this off-by-one. Also see
* the comment on MAX_NR_TIERS.
*/
return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
}

static inline int folio_lru_gen(struct folio *folio)
{
unsigned long flags = READ_ONCE(folio->flags);
Expand Down Expand Up @@ -171,6 +198,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
__update_lru_size(lruvec, lru, zone, -delta);
return;
}

/* promotion */
if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
__update_lru_size(lruvec, lru, zone, -delta);
__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
}

/* demotion requires isolation, e.g., lru_deactivate_fn() */
VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
}

static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
Expand Down
42 changes: 42 additions & 0 deletions include/linux/mmzone.h
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,29 @@ enum lruvec_flags {
#define MIN_NR_GENS 2U
#define MAX_NR_GENS 4U

/*
* Each generation is divided into multiple tiers. Tiers represent different
* ranges of numbers of accesses through file descriptors. A page accessed N
* times through file descriptors is in tier order_base_2(N). A page in the
* first tier (N=0,1) is marked by PG_referenced unless it was faulted in
* though page tables or read ahead. A page in any other tier (N>1) is marked
* by PG_referenced and PG_workingset. This implies a minimum of two tiers is
* supported without using additional bits in folio->flags.
*
* In contrast to moving across generations which requires the LRU lock, moving
* across tiers only involves atomic operations on folio->flags and therefore
* has a negligible cost in the buffered access path. In the eviction path,
* comparisons of refaulted/(evicted+protected) from the first tier and the
* rest infer whether pages accessed multiple times through file descriptors
* are statistically hot and thus worth protecting.
*
* MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
* number of categories of the active/inactive LRU when keeping track of
* accesses through file descriptors. It uses MAX_NR_TIERS-2 spare bits in
* folio->flags (LRU_REFS_MASK).
*/
#define MAX_NR_TIERS 4U

#ifndef __GENERATING_BOUNDS_H

struct lruvec;
Expand All @@ -362,6 +385,16 @@ enum {
LRU_GEN_FILE,
};

#define MIN_LRU_BATCH BITS_PER_LONG
#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128)

/* whether to keep historical stats from evicted generations */
#ifdef CONFIG_LRU_GEN_STATS
#define NR_HIST_GENS MAX_NR_GENS
#else
#define NR_HIST_GENS 1U
#endif

/*
* The youngest generation number is stored in max_seq for both anon and file
* types as they are aged on an equal footing. The oldest generation numbers are
Expand All @@ -384,6 +417,15 @@ struct lru_gen_struct {
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
/* the sizes of the above lists */
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
/* the exponential moving average of refaulted */
unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
/* the exponential moving average of evicted+protected */
unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
/* the first tier doesn't need protection, hence the minus one */
unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
/* can be modified without holding the LRU lock */
atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
};

void lru_gen_init_lruvec(struct lruvec *lruvec);
Expand Down
5 changes: 4 additions & 1 deletion include/linux/page-flags-layout.h
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,10 @@
#error "Not enough bits in page flags"
#endif

#define LRU_REFS_WIDTH 0
/* see the comment on MAX_NR_TIERS */
#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)

#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
2 changes: 2 additions & 0 deletions kernel/bounds.c
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@ int main(void)
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
#ifdef CONFIG_LRU_GEN
DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
#else
DEFINE(LRU_GEN_WIDTH, 0);
DEFINE(__LRU_REFS_WIDTH, 0);
#endif
/* End of constants */

Expand Down
11 changes: 11 additions & 0 deletions mm/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -909,6 +909,7 @@ config ANON_VMA_NAME
area from being merged with adjacent virtual memory areas due to the
difference in their name.

# multi-gen LRU {
config LRU_GEN
bool "Multi-Gen LRU"
depends on MMU
Expand All @@ -917,6 +918,16 @@ config LRU_GEN
help
A high performance LRU implementation to overcommit memory.

config LRU_GEN_STATS
bool "Full stats for debugging"
depends on LRU_GEN
help
Do not enable this option unless you plan to look at historical stats
from evicted generations for debugging purpose.

This option has a per-memcg and per-node memory overhead.
# }

source "mm/damon/Kconfig"

endmenu
39 changes: 39 additions & 0 deletions mm/swap.c
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,40 @@ static void __lru_cache_activate_folio(struct folio *folio)
local_unlock(&lru_pvecs.lock);
}

#ifdef CONFIG_LRU_GEN
static void folio_inc_refs(struct folio *folio)
{
unsigned long new_flags, old_flags = READ_ONCE(folio->flags);

if (folio_test_unevictable(folio))
return;

if (!folio_test_referenced(folio)) {
folio_set_referenced(folio);
return;
}

if (!folio_test_workingset(folio)) {
folio_set_workingset(folio);
return;
}

/* see the comment on MAX_NR_TIERS */
do {
new_flags = old_flags & LRU_REFS_MASK;
if (new_flags == LRU_REFS_MASK)
break;

new_flags += BIT(LRU_REFS_PGOFF);
new_flags |= old_flags & ~LRU_REFS_MASK;
} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
}
#else
static void folio_inc_refs(struct folio *folio)
{
}
#endif /* CONFIG_LRU_GEN */

/*
* Mark a page as having seen activity.
*
Expand All @@ -417,6 +451,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
}

if (!folio_test_referenced(folio)) {
folio_set_referenced(folio);
} else if (folio_test_unevictable(folio)) {
Expand Down

0 comments on commit 7169860

Please sign in to comment.