Skip to content

Commit

Permalink
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/sc…
Browse files Browse the repository at this point in the history
…m/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
  • Loading branch information
torvalds committed Apr 28, 2023
2 parents 91ec4b0 + 4d4b6d6 commit 7fa8a8e
Show file tree
Hide file tree
Showing 306 changed files with 11,767 additions and 8,185 deletions.
8 changes: 8 additions & 0 deletions Documentation/ABI/testing/sysfs-kernel-mm-ksm
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes.

When it is set to 0 only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default).

What: /sys/kernel/mm/ksm/general_profit
Date: April 2023
KernelVersion: 6.4
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Measure how effective KSM is.
general_profit: how effective is KSM. The formula for the
calculation is in Documentation/admin-guide/mm/ksm.rst.
6 changes: 3 additions & 3 deletions Documentation/admin-guide/kdump/vmcoreinfo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number
of free pages.

Each zone has a free_area structure array called free_area[MAX_ORDER].
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
The free_list represents a linked list of free page blocks.

(list_head, next|prev)
Expand All @@ -189,8 +189,8 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region
from this.

(zone.free_area, MAX_ORDER)
---------------------------
(zone.free_area, MAX_ORDER + 1)
-------------------------------

Free areas descriptor. User-space tools use this value to iterate the
free_area ranges. MAX_ORDER is used by the zone buddy allocator.
Expand Down
2 changes: 1 addition & 1 deletion Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4012,7 +4012,7 @@
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
reporting is disabled when it exceeds (MAX_ORDER-1).
reporting is disabled when it exceeds MAX_ORDER.

panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
Expand Down
5 changes: 4 additions & 1 deletion Documentation/admin-guide/mm/ksm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ stable_node_chains_prune_millisecs

The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:

general_profit
how effective is KSM. The calculation is explained below.
pages_shared
how many shared pages are being used
pages_sharing
Expand Down Expand Up @@ -207,7 +209,8 @@ several times, which are unprofitable memory consumed.
ksm_rmap_items * sizeof(rmap_item).

where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.

From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
Expand Down
25 changes: 25 additions & 0 deletions Documentation/admin-guide/mm/userfaultfd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,31 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.

Userfaultfd write-protect mode currently behave differently on none ptes
(when e.g. page is missing) over different types of memories.

For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
(e.g. when pages are missing and not populated). For file-backed memories
like shmem and hugetlbfs, none ptes will be write protected just like a
present pte. In other words, there will be a userfaultfd write fault
message generated when writing to a missing page on file typed memories,
as long as the page range was write-protected before. Such a message will
not be generated on anonymous memories by default.

If the application wants to be able to write protect none ptes on anonymous
memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
and set the feature bit in advance to make sure none ptes will also be
write protected even upon anonymous memory.

When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
respectively, it may be desirable for the new page / mapping to be
write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.

QEMU/KVM
========

Expand Down
16 changes: 11 additions & 5 deletions Documentation/core-api/printk-formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -575,20 +575,26 @@ The field width is passed by value, the bitmap is passed by reference.
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
printing cpumask and nodemask.

Flags bitfields such as page flags, gfp_flags
---------------------------------------------
Flags bitfields such as page flags, page_type, gfp_flags
--------------------------------------------------------

::

%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
%pGt 0xffffff7f(buddy)
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite

For printing flags bitfields as a collection of symbolic constants that
would construct the value. The type of flags is given by the third
character. Currently supported are [p]age flags, [v]ma_flags (both
expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
names and print order depends on the particular type.
character. Currently supported are:

- p - [p]age flags, expects value of type (``unsigned long *``)
- t - page [t]ype, expects value of type (``unsigned int *``)
- v - [v]ma_flags, expects value of type (``unsigned long *``)
- g - [g]fp_flags, expects value of type (``gfp_t *``)

The flag names and print order depends on the particular type.

Note that this format should not be used directly in the
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()
Expand Down
4 changes: 2 additions & 2 deletions Documentation/filesystems/locking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page)
open: yes
close: yes
fault: yes can return with page locked
map_pages: yes
map_pages: read
page_mkwrite: yes can return with page locked
pfn_mkwrite: yes
access: yes
Expand All @@ -661,7 +661,7 @@ locked. The VM will unlock the page.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with page table locked and must
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with the page is passed in
Expand Down
8 changes: 8 additions & 0 deletions Documentation/filesystems/proc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -996,6 +996,7 @@ Example output. You may not have all of these fields.
VmallocUsed: 40444 kB
VmallocChunk: 0 kB
Percpu: 29312 kB
EarlyMemtestBad: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4149248 kB
ShmemHugePages: 0 kB
Expand Down Expand Up @@ -1146,6 +1147,13 @@ VmallocChunk
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
EarlyMemtestBad
The amount of RAM/memory in kB, that was identified as corrupted
by early memtest. If memtest was not run, this field will not
be displayed at all. Size is never rounded down to 0 kB.
That means if 0 kB is reported, you can safely assume
there was at least one pass of memtest and none of the passes
found a single faulty byte of RAM.
HardwareCorrupted
The amount of RAM/memory in KB, the kernel identifies as
corrupted.
Expand Down
66 changes: 55 additions & 11 deletions Documentation/filesystems/tmpfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,29 @@ everything stored therein is lost.

tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space. It has maximum size limits which can
be adjusted on the fly via 'mount -o remount ...'

If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.

Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
unneeded pages out to swap space, if swap was enabled for the tmpfs
mount. tmpfs also supports THP.

tmpfs extends ramfs with a few userspace configurable options listed and
explained further below, some of which can be reconfigured dynamically on the
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
filesystem can be resized but it cannot be resized to a size below its current
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
modify any parameter for a ramfs filesystem. The size limit of a ramfs
filesystem is how much memory you have available, and so care must be taken if
used so to not run out of memory.

An alternative to tmpfs and ramfs is to use brd to create RAM disks
(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
To write data you would just then need to create an regular filesystem on top
this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
configured in size at initialization and you cannot dynamically resize them.
Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
block layer at all.

Since tmpfs lives completely in the page cache and optionally on swap,
all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
free(1). Notice that these counters also include shared memory
(shmem, see ipcs(1)). The most reliable way to get the count is
using df(1) and du(1).
Expand Down Expand Up @@ -72,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
noswap Disables swap. Remounts must respect the original settings.
By default swap is enabled.
========= ============================================================

These parameters accept a suffix k, m or g for kilo, mega and giga and
Expand All @@ -85,6 +99,36 @@ mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many CPUs making intensive use of it.

tmpfs also supports Transparent Huge Pages which requires a kernel
configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
your system (has_transparent_hugepage(), which is architecture specific).
The mount options for this are:

====== ============================================================
huge=0 never: disables huge pages for the mount
huge=1 always: enables huge pages for the mount
huge=2 within_size: only allocate huge pages if the page will be
fully within i_size, also respect fadvise()/madvise() hints.
huge=3 advise: only allocate huge pages if requested with
fadvise()/madvise()
====== ============================================================

There is a sysfs file which you can also use to control system wide THP
configuration for all tmpfs mounts, the file is:

/sys/kernel/mm/transparent_hugepage/shmem_enabled

This sysfs file is placed on top of THP sysfs directory and so is registered
by THP code. It is however only used to control all tmpfs mounts with one
single knob. Since it controls all tmpfs mounts it should only be used either
for emergency or testing purposes. The values you can set for shmem_enabled are:

== ============================================================
-1 deny: disables huge on shm_mnt and all mounts, for
emergency use
-2 force: enables huge on shm_mnt and all mounts, w/o needing
option, for testing
== ============================================================

tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
Expand Down
6 changes: 6 additions & 0 deletions Documentation/mm/active_mm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Active MM
=========

Note, the mm_count refcount may no longer include the "lazy" users
(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
helpers, which abstract this config option.

::

List: linux-kernel
Expand Down
2 changes: 1 addition & 1 deletion Documentation/mm/arch_pgtable_helpers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ HugeTLB Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_huge | Tests a HugeTLB |
+---------------------------+--------------------------------------------------+
| pte_mkhuge | Creates a HugeTLB |
| arch_make_huge_pte | Creates a HugeTLB |
+---------------------------+--------------------------------------------------+
| huge_pte_dirty | Tests a dirty HugeTLB |
+---------------------------+--------------------------------------------------+
Expand Down
44 changes: 39 additions & 5 deletions Documentation/mm/multigen_lru.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on
``folio->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refaults over all the tiers
from anon and file types and decides which tiers from which types to
evict or protect.
evict or protect. The desired effect is to balance refault percentages
between anon and file types proportional to the swappiness level.

There are two conceptually independent procedures: the aging and the
eviction. They form a closed-loop system, i.e., the page reclaim.
Expand Down Expand Up @@ -156,6 +157,27 @@ This time-based approach has the following advantages:
and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

``mm_struct`` list
------------------
An ``mm_struct`` list is maintained for each memcg, and an
``mm_struct`` follows its owner task to the new memcg when this task
is migrated.

A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
``walk_page_range()`` with each ``mm_struct`` on this list to scan
PTEs. When multiple page table walkers iterate the same list, each of
them gets a unique ``mm_struct``, and therefore they can run in
parallel.

Page table walkers ignore any misplaced pages, e.g., if an
``mm_struct`` was migrated, pages left in the previous memcg will be
ignored when the current memcg is under reclaim. Similarly, page table
walkers will ignore pages from nodes other than the one under reclaim.

This infrastructure also tracks the usage of ``mm_struct`` between
context switches so that page table walkers can skip processes that
have been sleeping since the last iteration.

Rmap/PT walk feedback
---------------------
Searching the rmap for PTEs mapping each page on an LRU list (to test
Expand All @@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it
adds the PMD entry pointing to the PTE table to the Bloom filter. This
forms a feedback loop between the eviction and the aging.

Bloom Filters
Bloom filters
-------------
Bloom filters are a space and memory efficient data structure for set
membership test, i.e., test if an element is not in the set or may be
Expand All @@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs,
which may yield hot pages anyway. Parameters of the filter itself can
control the false positive rate in the limit.

PID controller
--------------
A feedback loop modeled after the Proportional-Integral-Derivative
(PID) controller monitors refaults over anon and file types and
decides which type to evict when both types are available from the
same generation.

The PID controller uses generations rather than the wall clock as the
time domain because a CPU can scan pages at different rates under
varying memory pressure. It calculates a moving average for each new
generation to avoid being permanently locked in a suboptimal state.

Memcg LRU
---------
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
Expand Down Expand Up @@ -223,9 +257,9 @@ parts:

* Generations
* Rmap walks
* Page table walks
* Bloom filters
* PID controller
* Page table walks via ``mm_struct`` list
* Bloom filters for rmap/PT walk feedback
* PID controller for refault feedback

The aging and the eviction form a producer-consumer model;
specifically, the latter drives the former by the sliding window over
Expand Down
2 changes: 2 additions & 0 deletions Documentation/mm/unevictable-lru.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:

* Those owned by ramfs.

* Those owned by tmpfs with the noswap mount option.

* Those mapped into SHM_LOCK'd shared memory regions.

* Those mapped into VM_LOCKED [mlock()ed] VMAs.
Expand Down
5 changes: 4 additions & 1 deletion MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -13457,13 +13457,14 @@ F: arch/powerpc/include/asm/membarrier.h
F: include/uapi/linux/membarrier.h
F: kernel/sched/membarrier.c

MEMBLOCK
MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION
M: Mike Rapoport <rppt@kernel.org>
L: linux-mm@kvack.org
S: Maintained
F: Documentation/core-api/boot-time-mm.rst
F: include/linux/memblock.h
F: mm/memblock.c
F: mm/mm_init.c
F: tools/testing/memblock/

MEMORY CONTROLLER DRIVERS
Expand Down Expand Up @@ -13498,6 +13499,7 @@ F: include/linux/memory_hotplug.h
F: include/linux/mm.h
F: include/linux/mmzone.h
F: include/linux/pagewalk.h
F: include/trace/events/ksm.h
F: mm/
F: tools/mm/
F: tools/testing/selftests/mm/
Expand All @@ -13506,6 +13508,7 @@ VMALLOC
M: Andrew Morton <akpm@linux-foundation.org>
R: Uladzislau Rezki <urezki@gmail.com>
R: Christoph Hellwig <hch@infradead.org>
R: Lorenzo Stoakes <lstoakes@gmail.com>
L: linux-mm@kvack.org
S: Maintained
W: http://www.linux-mm.org
Expand Down
Loading

0 comments on commit 7fa8a8e

Please sign in to comment.