pidref: use fd_inode_same to compare pidfds #31713

YHNdnzj · 2024-03-11T08:57:16Z

No description provided.

src/basic/pidref.c

YHNdnzj · 2024-03-11T10:21:51Z

Hmm, anon inodes has st_mode & S_IFMT == 0... This breaks the assumption in stat_inode_same

src/basic/stat-util.c

For anonymous inodes, the result would be 0, but the struct stat is initialized obviously. So let's switch to st_dev for the check, which is guaranteed to be non-zero. Also this is completely unnecessary for statx(), since we check stx_mask first and that on its own denotes that the struct is initialized.

YHNdnzj · 2024-03-12T06:15:15Z

Noble CI failures are in TEST-46-HOMED, shouldn't be related.

Enable pidfs unconditionally. There's no real reason not do to it. For 32bit systems we add a simple inode allocation mechanism that still guarantees that userspace can compare processes by inode number which they already do as I found out in [1]. If they also need the uniqueness property that we get by default on 64bit systems they should simply parse the contents of /proc/<pid>/fd/<nr>. On 64bit we don't have to deal with any of this and things are nice and simple. Link: systemd/systemd#31713 [1] Signed-off-by: Christian Brauner <brauner@kernel.org>

Enable pidfs unconditionally. There's no real reason not do to it. For 32bit systems we add a simple inode allocation mechanism that still guarantees that userspace can compare processes by inode number which they already do as I found out in [1]. If they also need the uniqueness property that we get by default on 64bit systems they should simply parse the contents of /proc/<pid>/fd/<nr>. On 64bit we don't have to deal with any of this and things are nice and simple. Link: systemd/systemd#31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>

As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases. For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're contstrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple. If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request. Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise. Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho. Tested on both 32 and 64bit including error injection. Link: systemd/systemd#31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>

As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases. For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple. If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request. Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise. Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho. Tested on both 32 and 64bit including error injection. Link: systemd/systemd#31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

* mm: remove folio from deferred split list before uncharging it When freeing a large folio, we must remove it from the deferred split list before we uncharge it as each memcg has its own deferred split list (with associated lock) and removing a folio from the deferred split list while holding the wrong lock will corrupt that list and cause various related problems. Link: https://lore.kernel.org/linux-mm/367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com/ Link: https://lkml.kernel.org/r/20240311191835.312162-1-willy@infradead.org Fixes: f77171d241e3 (mm: allow non-hugetlb large folios to be batch processed) Fixes: 29f3843026cf (mm: free folios directly in move_folios_to_lru()) Fixes: bc2ff4cbc329 (mm: free folios in a batch in shrink_folio_list()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Debugged-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm: fix list corruption in put_pages_list My recent change to put_pages_list() dereferences folio->lru.next after returning the folio to the page allocator. Usually this is now on the pcp list with other free folios, so we try to free an already-free folio. This only happens with lists that have more than 15 entries, so it wasn't immediately discovered. Revert to using list_for_each_safe() so we dereference lru.next before disposing of the folio. Link: https://lkml.kernel.org/r/20240306212749.1823380-1-willy@infradead.org Fixes: 24835f899c01 ("mm: use free_unref_folios() in put_pages_list()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/intel-gfx/SJ1PR11MB61292145F3B79DA58ADDDA63B9232@SJ1PR11MB6129.namprd11.prod.outlook.com/ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm: add an explicit smp_wmb() to UFFDIO_CONTINUE Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may believe that all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone subsequently reading the page through the newly mapped virtual memory region. Today, such a user happens to be correct. mmget_not_zero(), for example, is called as part of UFFDIO_CONTINUE (and comes before any PTE updates), and it implicitly gives us a write barrier. To be resilient against future changes, include an explicit smp_wmb(). While we're at it, optimize the smp_wmb() that is already incidentally present for the HugeTLB case. Merely making a syscall does not generally imply the memory ordering constraints that we need (including on x86). Link: https://lkml.kernel.org/r/20240307010250.3847179-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon pages. However, it should be more careful to use the mode because it's going to prevent anon pages from being reclaimed even if there are a huge number of anon pages that are cold and should be reclaimed. Even worse, that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping kswapd from functioning until direct reclaim eventually works to resume kswapd. So kswapd needs to retry its scan priority loop with cache_trim_mode off again if the mode doesn't work for reclaim. The problematic behavior can be reproduced by: CONFIG_NUMA_BALANCING enabled sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING numa node0 (8GB local memory, 16 CPUs) numa node1 (8GB slow tier memory, no CPUs) Sequence: 1) echo 3 > /proc/sys/vm/drop_caches 2) To emulate the system with full of cold memory in local DRAM, run the following dummy program and never touch the region: mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0); 3) Run any memory intensive work e.g. XSBench. 4) Check if numa balancing is working e.i. promotion/demotion. 5) Iterate 1) ~ 4) until numa balancing stops. With this, you could see that promotion/demotion are not working because kswapd has stopped due to ->kswapd_failures >= MAX_RECLAIM_RETRIES. Interesting vmstat delta's differences between before and after are like: +-----------------------+-------------------------------+ | interesting vmstat | before | after | +-----------------------+-------------------------------+ | nr_inactive_anon | 321935 | 1664772 | | nr_active_anon | 1780700 | 437834 | | nr_inactive_file | 30425 | 40882 | | nr_active_file | 14961 | 3012 | | pgpromote_success | 356 | 1293122 | | pgpromote_candidate | 21953245 | 1824148 | | pgactivate | 1844523 | 3311907 | | pgdeactivate | 50634 | 1554069 | | pgfault | 31100294 | 6518806 | | pgdemote_kswapd | 30856 | 2230821 | | pgscan_kswapd | 1861981 | 7667629 | | pgscan_anon | 1822930 | 7610583 | | pgscan_file | 39051 | 57046 | | pgsteal_anon | 386 | 2192033 | | pgsteal_file | 30470 | 38788 | | pageoutrun | 30 | 412 | | numa_hint_faults | 27418279 | 2875955 | | numa_pages_migrated | 356 | 1293122 | +-----------------------+-------------------------------+ Link: https://lkml.kernel.org/r/20240304082118.20499-1-byungchul@sk.com Signed-off-by: Byungchul Park <byungchul@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm/huge_memory: check new folio order when split a folio A folio can only be split into lower orders. Since there are no new_order checks in debugfs, any new_order can be passed via debugfs into split_huge_page_to_list_to_order(). Check new_order to make sure it is smaller than input folio order. Link: https://lkml.kernel.org/r/20240307181854.138928-1-zi.yan@sent.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto.mountain/ Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm/huge_memory: skip invalid debugfs new_order input for folio split User can put arbitrary new_order via debugfs for folio split test. Although new_order check is added to split_huge_page_to_list_order() in the prior commit, these two additional checks can avoid unnecessary folio locking and split_folio_to_order() calls. Link: https://lkml.kernel.org/r/20240307181854.138928-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto.mountain/ Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * selftests/mm: dont fail testsuite due to a lack of hugepages Patch series "selftests/mm: Improve Hugepage Test Handling in MM Selftests", v2. This series addresses issues related to hugepage requirements in the MM selftests, ensuring tests are skipped rather than failing when the necessary hugepage count is not met. This adjustment allows for a more graceful handling for systems with insufficient hugepages, preventing unnecessary test failures and improving the overall robustness of the test suite. This patch (of 3): On systems that have large core counts and large page sizes, but limited memory, the userfaultfd test hugepage requirement is too large. Exiting early due to missing one test's requirements is a rather aggressive strategy, and prevents a lot of other tests from running. Remove the early exit to prevent this. Link: https://lkml.kernel.org/r/20240306223714.320681-1-npache@redhat.com Link: https://lkml.kernel.org/r/20240306223714.320681-2-npache@redhat.com Fixes: ee00479d6702 ("selftests: vm: Try harder to allocate huge pages") Signed-off-by: Nico Pache <npache@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * selftests/mm: skip uffd hugetlb tests with insufficient hugepages Now that run_vmtests.sh does not guarantee that the correct hugepage count is available, add a check inside the userfaultfd hugetlb test to verify the nr_hugepages count before continuing. Link: https://lkml.kernel.org/r/20240306223714.320681-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements Now that run_vmtests.sh does not guarantee that the correct hugepage count is available, skip the hugetlb-madvise test if the requirements are not met rather than failing. Link: https://lkml.kernel.org/r/20240306223714.320681-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mul_u64_u64_div_u64: increase precision by conditionally swapping a and b As indicated in the added comment, the algorithm works better if b is big. As multiplication is commutative, a and b can be swapped. Do this if a is bigger than b. Link: https://lkml.kernel.org/r/20240303092408.662449-2-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * nilfs2: use div64_ul() instead of do_div() Fixes Coccinelle/coccicheck warnings reported by do_div.cocci. Compared to do_div(), div64_ul() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lkml.kernel.org/r/20240229210456.63234-2-thorsten.blum@toblux.com Link: https://lkml.kernel.org/r/20240306142547.4612-1-konishi.ryusuke@gmail.com Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * watchdog/core: remove sysctl handlers from public header The functions are only used in the file where they are defined. Remove them from the header and make them static. Also guard proc_soft_watchdog with a #define-guard as it is not used otherwise. Link: https://lkml.kernel.org/r/20240306-const-sysctl-prep-watchdog-v1-1-bd45da3a41cf@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * buildid: use kmap_local_page() Use kmap_local_page() instead of kmap_atomic() which has been deprecated. Link: https://lkml.kernel.org/r/20240306034804.62087-1-flyingpeng@tencent.com Signed-off-by: Peng Hao <flyingpeng@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * assoc_array: fix the return value in assoc_array_insert_mid_shortcut() Returning the edit variable is redundant because it is dereferenced right before it is returned. It would be better to return true. Found by Linux Verification Center (linuxtesting.org) with Svace. Link: https://lkml.kernel.org/r/20240307071717.5318-1-r.smirnov@omp.ru Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * Revert "crypto: remove CONFIG_CRYPTO_STATS" This reverts commit 2beb81fbf0c01a62515a1bcef326168494ee2bd0. While removing CONFIG_CRYPTO_STATS is a worthy goal, this also removed unrelated infrastructure such as crypto_comp_alg_common. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> * mm, slab: remove last vestiges of SLAB_MEM_SPREAD Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> * ALSA: hda/realtek - ALC236 fix volume mute & mic mute LED on some HP models Some HP laptops have received revisions that altered their board IDs and therefore the current patches/quirks do not apply to them. Specifically, for my Probook 440 G8, I have a board ID of 8a74. It is necessary to add a line for that specific model. Signed-off-by: Valentine Altair <faetalize@proton.me> Cc: <stable@vger.kernel.org> Message-ID: <kOqXRBcxkKt6m5kciSDCkGqMORZi_HB3ZVPTX5sD3W1pKxt83Pf-WiQ1V1pgKKI8pYr4oGvsujt3vk2zsCE-DDtnUADFG6NGBlS5N3U4xgA=@proton.me> Signed-off-by: Takashi Iwai <tiwai@suse.de> * ALSA: hda/tas2781: remove unnecessary runtime_pm calls The runtime_pm handling seems to have been loosely inspired by the cs32l41 driver, but in this case the get_noresume/put sequence is not required. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Message-ID: <20240312161217.79510-1-pierre-louis.bossart@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> * Revert "arm64: mm: add support for WXN memory translation attribute" This reverts commit 50e3ed0f93f4f62ed2aa83de5db6cb84ecdd5707. The SCTLR_EL1.WXN control forces execute-never when a page has write permissions. While the idea of hardening such write/exec combinations is good, with permissions indirection enabled (FEAT_PIE) this control becomes RES0. FEAT_PIE introduces a slightly different form of WXN which only has an effect when the base permission is RWX and the write is toggled by the permission overlay (FEAT_POE, not yet supported by the arm64 kernel). Revert the patch for now. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/ZfGESD3a91lxH367@arm.com * Revert "mm: add arch hook to validate mmap() prot flags" This reverts commit cb1a393c40eee2f1692c995ea0cc6e45bfccde4d. Since the arm64 WXN patch has been reverted, remove this hook as it would not have any users. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/ZfGESD3a91lxH367@arm.com * ALSA: usb-audio: Stop parsing channels bits when all channels are found. If a usb audio device sets more bits than the amount of channels it could write outside of the map array. Signed-off-by: Johan Carlsson <johan.carlsson@teenage.engineering> Fixes: 04324ccc75f9 ("ALSA: usb-audio: add channel map support") Message-ID: <20240313081509.9801-1-johan.carlsson@teenage.engineering> Signed-off-by: Takashi Iwai <tiwai@suse.de> * mm: recover pud_leaf() definitions in nopmd case This reverts one change in commit 924bd6a8c967 ("mm/x86: drop two unnecessary pud_leaf() definitions"). One issue with that is it broke nopmd builds for at least both arm64 and riscv (CONFIG_PGTABLE_LEVELS=2). The other issue is it was overlooked that it's a common change rather than x86 specific (relevant to the commit message of the commit). Normally there's no need for empty definition of pXd_leaf() because of the fallback functions, however this logic may not apply to pgtable-nopmd.h, because that's a header that can even be used by arch *pgtable.h headers, which can use the *_leaf() definitions _before_ the fallback functions are defined. Leave it there to pass PGTABLE_LEVELS=2 builds. Link: https://lkml.kernel.org/r/Ze8vFNV9YSdgC2S7@x1n Fixes: 924bd6a8c967 ("mm/x86: drop two unnecessary pud_leaf() definitions") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403090900.OwPqmRuI-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202403101607.a42gaLOS-lkp@intel.com/ Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm: prohibit the last subpage from reusing the entire large folio In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire large folio, resulting in the waste of (nr_pages - 1) pages. This wasted memory remains allocated until it is either unmapped or memory reclamation occurs. The following small program can serve as evidence of this behavior main() { #define SIZE 1024 * 1024 * 1024UL void *p = malloc(SIZE); memset(p, 0x11, SIZE); if (fork() == 0) _exit(0); memset(p, 0x12, SIZE); printf("done\n"); while(1); } For example, using a 1024KiB mTHP by: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled (1) w/o the patch, it takes 2GiB, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 84 5692 0 17 5669 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 2149 3627 0 19 3605 Swap: 0 0 0 (2) w/ the patch, it takes 1GiB only, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 89 5687 0 17 5664 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 1122 4655 0 17 4632 Swap: 0 0 0 This patch migrates the last subpage to a small folio and immediately returns the large folio to the system. It benefits both memory availability and anti-fragmentation. Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * memtest: use {READ,WRITE}_ONCE in memory scanning memtest failed to find bad memory when compiled with clang. So use {WRITE,READ}_ONCE to access memory to avoid compiler over optimization. Link: https://lkml.kernel.org/r/20240312080422.691222-1-qiang4.zhang@intel.com Signed-off-by: Qiang Zhang <qiang4.zhang@intel.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * crypto: introduce: acomp_is_async to expose if comp drivers might sleep acomp's users might want to know if acomp is really async to optimize themselves. One typical user which can benefit from exposed async stat is zswap. In zswap, zsmalloc is the most commonly used allocator for (and perhaps the only one). For zsmalloc, we cannot sleep while we map the compressed memory, so we copy it to a temporary buffer. By knowing the alg won't sleep can help zswap to avoid the need for a buffer. This shows noticeable improvement in load/store latency of zswap. Link: https://lkml.kernel.org/r/20240222081135.173040-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Chris Li <chrisl@kernel.org> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * mm/zswap: remove the memcpy if acomp is not sleepable Most compressors are actually CPU-based and won't sleep during compression and decompression. We should remove the redundant memcpy for them. This patch checks if the algorithm is sleepable by testing the CRYPTO_ALG_ASYNC algorithm flag. Generally speaking, async and sleepable are semantically similar but not equal. But for compress drivers, they are basically equal at least due to the below facts. Firstly, scompress drivers - crypto/deflate.c, lz4.c, zstd.c, lzo.c etc have no sleep. Secondly, zRAM has been using these scompress drivers for years in atomic contexts, and never worried those drivers going to sleep. One exception is that an async driver can sometimes still return synchronously per Herbert's clarification. In this case, we are still having a redundant memcpy. But we can't know if one particular acomp request will sleep or not unless crypto can expose more details for each specific request from offload drivers. Link: https://lkml.kernel.org/r/20240222081135.173040-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * pidfs: remove config option As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases. For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple. If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request. Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise. Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho. Tested on both 32 and 64bit including error injection. Link: https://github.com/systemd/systemd/pull/31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> * block: limit block time caching to in_task() context We should not have any callers of this from non-task context, but Jakub ran [1] into one from blk-iocost. Rather than risk running into others, or future ones, just limit blk_time_get_ns() to when it is called from a task. Any other usage is invalid. [1] https://lore.kernel.org/lkml/CAHk-=wiOaBLqarS2uFhM1YdwOvCX4CZaWkeyNDY1zONpbYw2ig@mail.gmail.com/ Fixes: da4c8c3d0975 ("block: cache current nsec time in struct blk_plug") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> * Revert "block/mq-deadline: use correct way to throttling write requests" The code "max(1U, 3 * (1U << shift) / 4)" comes from the Kyber I/O scheduler. The Kyber I/O scheduler maintains one internal queue per hwq and hence derives its async_depth from the number of hwq tags. Using this approach for the mq-deadline scheduler is wrong since the mq-deadline scheduler maintains one internal queue for all hwqs combined. Hence this revert. Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: Zhiguo Niu <Zhiguo.Niu@unisoc.com> Fixes: d47f9717e5cf ("block/mq-deadline: use correct way to throttling write requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240313214218.1736147-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> * mtd: spi-nor: core: correct type of i The i should be signed to find out the end of the loop. Otherwise, i >= 0 is always true and loop becomes infinite. Make its type to be int. Fixes: 6a9eda34418f ("mtd: spi-nor: core: set mtd->eraseregions for non-uniform erase map") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Michael Walle <mwalle@kernel.org> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20240304090103.818092-1-usama.anjum@collabora.com * mempool: kvmalloc pool Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(), which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc() fallback. This is part of a bcachefs cleanup - dropping an internal kvpmalloc() helper (which predates kvmalloc()) along with mempool helpers; this replaces the bcachefs-private kvpmalloc_pool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: linux-mm@kvack.org * bcachefs: kill kvpmalloc() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_stdio: eliminate double buffering The output buffer lock has to be a spinlock so that we can write to it from interrupt context, so we can't use a direct copy_to_user; this switches thread_with_file_read() to use fault_in_writeable() and copy_to_user_nofault(), similar to how thread_with_file_write() works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_stdio: convert to darray - eliminate the dependency on printbufs, so that we can lift thread_with_file for use in xfs - add a nonblocking parameter to stdio_redirect_printf(), and either block if the buffer is full or drop it on the floor - don't buffer infinitely Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_stdio: kill thread_with_stdio_done() Move the cleanup code to a wrapper function, where we can call it after the thread_with_stdio fn exits. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline() This fixes a bug where we'd return data without waiting for a newline, if data was present but a newline was not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Thread with file documentation Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * of: Add cleanup.h based auto release via __free(device_node) markings The recent addition of scope based cleanup support to the kernel provides a convenient tool to reduce the chances of leaking reference counts where of_node_put() should have been called in an error path. This enables struct device_node *child __free(device_node) = NULL; for_each_child_of_node(np, child) { if (test) return test; } with no need for a manual call of of_node_put(). A following patch will reduce the scope of the child variable to the for loop, to avoid an issues with ordering of autocleanup, and make it obvious when this assigned a non NULL value. In this simple example the gains are small but there are some very complex error handling cases buried in these loops that will be greatly simplified by enabling early returns with out the need for this manual of_node_put() call. Note that there are coccinelle checks in scripts/coccinelle/iterators/for_each_child.cocci to detect a failure to call of_node_put(). This new approach does not cause false positives. Longer term we may want to add scripting to check this new approach is done correctly with no double of_node_put() calls being introduced due to the auto cleanup. It may also be useful to script finding places this new approach is useful. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20240225142714.286440-2-jic23@kernel.org Signed-off-by: Rob Herring <robh@kernel.org> * of: Introduce for_each_*_child_of_node_scoped() to automate of_node_put() handling To avoid issues with out of order cleanup, or ambiguity about when the auto freed data is first instantiated, do it within the for loop definition. The disadvantage is that the struct device_node *child variable creation is not immediately obvious where this is used. However, in many cases, if there is another definition of struct device_node *child; the compiler / static analysers will notify us that it is unused, or uninitialized. Note that, in the vast majority of cases, the _available_ form should be used and as code is converted to these scoped handers, we should confirm that any cases that do not check for available have a good reason not to. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20240225142714.286440-3-jic23@kernel.org Signed-off-by: Rob Herring <robh@kernel.org> * of: unittest: Use for_each_child_of_node_scoped() A simple example of the utility of this autocleanup approach to handling of_node_put(). In this particular case some of the nodes needed for the test are not available and the _available_ version would cause them to be skipped resulting in a test failure. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20240225142714.286440-4-jic23@kernel.org Signed-off-by: Rob Herring <robh@kernel.org> * bcachefs: thread_with_stdio: Mark completed in ->release() This fixes stdio_redirect_read() getting stuck, not noticing that the pipe has been closed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * kernel/hung_task.c: export sysctl_hung_task_timeout_secs needed for thread_with_file; also rare but not unheard of to need this in module code, when blocking on user input. one workaround used by some code is wait_event_interruptible() - but that can be buggy if the outer context isn't expecting unwinding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: fuyuanli <fuyuanli@didiglobal.com> * bcachefs: thread_with_stdio: suppress hung task warning Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: allow creation of readonly files Create a new run_thread_with_stdout function that opens a file in O_RDONLY mode so that the kernel can write things to userspace but userspace cannot write to the kernel. This will be used to convey xfs health event information to userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: fix various printf problems Experimentally fix some problems with stdio_redirect_vprintf by creating a MOO variant with which we can experiment. We can't do a GFP_KERNEL allocation while holding the spinlock, and I don't like how the printf function can silently truncate the output if memory allocation fails. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: create ops structure for thread_with_stdio Create an ops structure so we can add more file-based functionality in the next few patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: allow ioctls against these files Make it so that a thread_with_stdio user can handle ioctls against the file descriptor. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: Fix missing va_end() Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u Reported-by: coverity scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: thread_with_file: add f_ops.flush Add a flush op, to return the exit code via close(). Also update bcachefs usage to use this to return fsck exit codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Kill more -EIO error codes This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Check subvol <-> inode pointers in check_subvol() Subvolumes and subvolume root inodes point to each other: this verifies the subvolume -> inode -> subvolme path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Check subvol <-> inode pointers in check_inode() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check_inode_dirent_inode() check that if an inode has a backpointer, the dirent it points to points back to it. We do this in check_dirent_inode_dirent(), but only for inodes that have dirents that point to them - we also have to do the check starting from the inode to catch inodes that don't have dirents that point to them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: better log message in lookup_inode_for_snapshot() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check bi_parent_subvol in check_inode() check for inodes with a nonzero bi_parent_subvol field that aren't actually subvolume roots Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: simplify check_dirent_inode_dirent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: delete duplicated checks in check_dirent_to_subvol() these were already checked in check_subvol() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check inode->bi_parent_subvol against dirent Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check dirent->d_parent_subvol Check that d_parent_subvol makes sense - the dirent's snapshot must be visible in d_parent_subvol (i.e. an ancestor of d_parent_subvol's snapshot) in order to be visible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Repair subvol dirents that point to non subvols when repair switches d_type to or from DT_SUBVOL, we need to update the target accordingly Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch_subvolume::parent -> creation_parent bit of renaming, prep for adding a fs path parent Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Fix path where dirent -> subvol missing and we don't fix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Pass inode bkey to check_path() prep work for improving logging/error messages Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check_path() now prints full inode when reattaching Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Correctly reattach subvolumes Subvolumes need special handling to reattach - we always reattach them in the root subvolume's lost+found, and they need a slightly different kind of dirent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch2_btree_bit_mod -> bch2_btree_bit_mod_buffered Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch2_btree_bit_mod() Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for the subvolume children btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch_subvolume::fs_path_parent Record the filesystem path heirarchy for subvolumes in bch_subvolume Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: BTREE_ID_subvolume_children Add a btree to record a parent -> child subvolume relationships, according to the filesystem heirarchy. The subvolume_children btree is a bitset btree: if a bit is set at pos p, that means p.offset is a child of subvolume p.inode. This will be used for efficiently listing subvolumes, as well as recursive deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Check for subvolume children when deleting subvolumes Recursively destroying subvolumes isn't allowed yet. Fixes: https://github.com/koverstreet/bcachefs/issues/634 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Pin btree cache in ram for random access in fsck Various phases of fsck involve checking references from one btree to another: this means doing a sequential scan of one btree, and then mostly random access into the second. This is particularly painful for checking extents <-> backpointers; we can prefetch btree node access on the sequential scan, but not on the random access portion, and this is particularly painful on spinning rust, where we'd like to keep the pipeline fairly full of btree node reads so that the elevator can reduce seeking. This patch implements prefetching and pinning of the portion of the btree that we'll be doing random access to. We already calculate how much of the random access btree will fit in memory so it's a fairly straightforward change. This will put more pressure on system memory usage, so we introduce a new option, fsck_memory_usage_percent, which is the percentage of total system ram that fsck is allowed to pin. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Save key_cache_path in peek_slot() When bch2_btree_iter_peek_slot() clones the iterator to search for the next key, and then discovers that the key from the cloned iterator is the key we want to return - we also want to save the iter->key_cache_path as well, for the update path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Track iter->ip_allocated at bch2_trans_copy_iter() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Use kvzalloc() when dynamically allocating btree paths THis silences a mm/page_alloc.c warning about allocating more than a page with GFP_NOFAIL - and there's no reason for this to not have a vmalloc fallback anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Improve error messages in device remove path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch2_print_opts() Make sure early error messages get redirected, for kernel-fsck-from-userland. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch2_trigger_alloc() handles state changes better bch2_trigger_alloc() kicks off certain tasks on bucket state changes; e.g. triggering the bucket discard worker and the invalidate worker. We've observed the discard worker running too often - most runs it doesn't do any work, according to the tracepoint - so clearly, we're kicking it off too often. This adds an explicit statechange() macro to make these checks more precise. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: omit alignment attribute on big endian struct bkey This is needed for building Rust bindings on big endian architectures like s390x. Currently this is only done in userspace, but it might happen in-kernel in the future. When creating a Rust binding for struct bkey, the "packed" attribute is needed to get a type with the correct member offsets in the big endian case. However, rustc does not allow types to have both a "packed" and "align" attribute. Thus, in order to get a Rust type compatible with the C type, we must omit the "aligned" attribute in C. This does not affect the struct's size or member offsets, only its toplevel alignment, which should be an acceptable impact. The little endian version can have the "align" attribute because the "packed" attr is redundant, and rust-bindgen will omit the "packed" attr when an "align" attr is present and it can do so without changing a type's layout Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: bch2_check_subvolume_structure() Now that we've got bch_subvolume.fs_path_parent, it's easy to write subvolume Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: check_path() now only needs to walk up to subvolume root Now that checking subvolume structure is a separate pass, the main check_directory_connectivity() pass only needs to walk up to a given inode's subvolume root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: more informative write path error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: rebalance_status now shows correct units Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Drop redundant btree_path_downgrade()s If a path doesn't have any active references, we shouldn't downgrade it; it'll either be reused, possibly with intent refs again, or dropped at bch2_trans_begin() time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: improve bch2_journal_buf_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Split out discard fastpath Buckets usually can't be discarded until the transaction that made them empty has been committed in the journal. Tracing has indicated that we're queuing the discard worker excessively, only for it to skip over many buckets that are still waiting on a journal commit, discarding only one or two buckets per iteration. We want to switch to only queuing the discard worker after a journal flush write, but there's an important optimization we need to preserve: if a bucket becomes empty and it was never committed in the journal while it was in use, we want to discard it and reuse it right away - since overwriting it before the previous writes are flushed from the device cache eans those writes only cost bus bandwidth. So, this patch implements a fast path for buckets that can be discarded right away. We need new locking between the two discard workers; the new list of buckets being discarded provides that locking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Fix journal_buf bitfield accesses All jounal_buf bitfield updates must happen under the journal lock - perhaps we should just switch these to atomic bit flags. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Add journal.blocked to journal_debug_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Silence gcc warnings about arm arch ABI drift 32-bit arm builds emit a lot of spam like this: fs/bcachefs/backpointers.c: In function ‘extent_matches_bp’: fs/bcachefs/backpointers.c:15:13: note: parameter passing for argument of type ‘struct bch_backpointer’ changed in GCC 9.1 Apply the change from commit ebcc5928c5d9 ("arm64: Silence gcc warnings about arch ABI drift") to fs/bcachefs/ to silence them. Signed-off-by: Calvin Owens <jcalvinowens@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: remove redundant assignment to variable ret Variable ret is being assigned a value that is never read, it is being re-assigned a couple of statements later on. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/bcachefs/super-io.c:806:2: warning: Value stored to 'ret' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Errcode tracepoint, documentation Add a tracepoint for downcasting private errors to standard errors, so they can be recovered even when not logged; also, add some documentation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: jset_entry for loops declare loop iter Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Rename journal_keys.d -> journal_keys.data This will let us use some darray helpers in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: journal_keys now uses darray helpers nice bit of code cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: improve move_gap() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: split out ignore_blacklisted, ignore_not_dirty prep work for replaying the journal backwards Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: fix the error code when mounting with incorrect options. When mount with incorrect options such as: "mount -t bcachefs -o errors=back /dev/loop1 /mnt/bcachefs/". It rebacks the error "mount: /mnt/bcachefs: permission denied." cause bch2_parse_mount_opts returns -1 and bch2_mount throws it up. This is unreasonable. The real error message should be like this: "mount: /mnt/bcachefs: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error." Adding three private error codes for mounting error. Here are: - BCH_ERR_mount_option as the parent class for option error. - BCH_ERR_option_name represents the invalid option name. - BCH_ERR_option_value represents the invalid option value. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Fix bch2_journal_noflush_seq() Improved journal pipelining broke journal_noflush_seq(); it implicitly assumed only the oldest outstanding journal buf could be in flight, but that's no longer true. Make this more straightforward by just setting buf->must_flush whenever we know a journal buf is going to be flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * fs: file_remove_privs_flags() Rename and export __file_remove_privs(); for a buffered write path that doesn't take the inode lock we need to be able to check if the operation needs to do work first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> * bcachefs: Buffered write path now can avoid the inode lock Non append, non extending buffered writes can now avoid taking the inode lock. To ensure atomicity of writes w.r.t. other writes, we lock every folio that we'll be writing to, and if this fails we fall back to taking the inode lock. Extensive comments are provided as to corner cases. Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: avoid returning private error code in bch2_xattr_bcachefs_set Avoid the private error code return to caller. The error code should be transformed into genernal error code. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: intercept mountoption value for bool type For mount option with bool type, the value must be 0 or 1 (See bch2_opt_parse). But this seems does not well intercepted cause for other value(like 2...), it returns the unexpect return code with error message printed. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: fix lost journal buf wakeup due to improved pipelining The journal_write_done() handler was reworked into a loop in commit 746a33c96b7a ("bcachefs: better journal pipelining"). As part of this, the journal buffer wake was factored into a post-loop branch that executes if at least one journal buffer has completed. The journal buffer processing loop iterates on the journal buffer pointer, however. This means that w refers to the last buffer processed by the loop, which may or may not be done. This also means that if multiple buffers are processed by the loop, only the last is awoken. This lost wakeup behavior has lead to stalling problems in various CI and fstests, such as generic/703. Lift the wake into the loop so each done buffer sees a wake call as it is processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Split out bkey_types.h We're going to need bkey_types.h in bcachefs_ioctl.h in a future patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: copy_(to|from)_user_errcode() we've got some helpers that return errors sanely, move them to a more common location for use in fs-ioctl.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * lib/generic-radix-tree.c: Make nodes more reasonably sized this code originally used the page allocator directly, but most code shouldn't do that - PAGE_SIZE varies with architecture, and slab is faster. 4k is also on the large side for typical usage, 512 bytes is a better choice for typical usage that might be somewhat sparse. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: fix bch2_journal_buf_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Check for writing superblocks with nonsense member seq fields We're seeing some unmountable filesystems due to split brain detection going awry; it seems we somehow wrote out superblocks where we updated the superblock seq without updating any member seq fields. A given device's superblock should always have the main seq equal to it's member seq field, so this is easy to check for. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Kill unused flags argument to btree_split() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Prefer struct_size over open coded arithmetic This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2]. As the "op" variable is a pointer to "struct promote_op" and this structure ends in a flexible array: struct promote_op { [...] struct bio_vec bi_inline_vecs[]; }; and the "t" variable is a pointer to "struct journal_seq_blacklist_table" and this structure also ends in a flexible array: struct journal_seq_blacklist_table { [...] struct journal_seq_blacklist_table_entry { u64 start; u64 end; bool dirty; } entries[]; }; the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the argument "size + size * count" in the kzalloc() functions. This way, the code is more readable and safer. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <erick.archer@gmx.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: fix deletion of indirect extents in btree_gc we need to run the normal extent update path on deletion - bch2_bkey_make_mut() is incorrect when key type is changing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Fix order of gc_done passes gc_stripes_done() and gc_reflink_done() may do alloc btree updates (i.e. when deleting an indirect extent) - we need bucket gens to be fixed by then. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Always flush write buffer in delete_dead_inodes() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: Fix btree key cache coherency during replay Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: fix bch_folio_sector padding Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: reconstruct_alloc cleanup Now that we've got the errors_silent mechanism, we don't have to check if the reconstruct_alloc option is set all over the place. Also - users no longer have to explicitly select fsck and fix_errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: pull out time_stats.[ch] prep work for lifting out of fs/bcachefs/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: time_stats: add larger units Filesystems can stay mounted for a very long time, so add some larger units. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet The only caller of this code (time_stats) always knows the weights and whether or not any information has been collected. Pass this information into the mean and variance code so that it doesn't have to store that information. This reduces the structure size from 24 to 16 bytes, which shrinks each time_stats counter to 192 bytes from 208. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: time_stats: split stats-with-quantiles into a separate structure Currently, struct time_stats has the optional ability to quantize the information that it collects. This is /probably/ useful for callers who want to see quantized information, but it more than doubles the size of the structure from 224 bytes to 464. For users who don't care about that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per counter, split the two into separate pieces. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * bcachefs: time_stats: shrink time_stat_buffer for better alignment Shrink this percpu object by one array element so that the object size becomes exactly 512 bytes. This will lead to more efficient memory use, hopefully. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> * Revert "blk-lib: check for kill signal" This reverts commit 8a08c5fd89b447a7de7eb293a7a274c46b932ba2. It turns out while this is a perfectly valid and long overdue thing to do for user initiated discards / zeroing from the ioctl handler, it actually breaks file system use of the discard helper by interrupting in places the file system doesn't expect, and by leaving the bio chain in a state that the file system callers of (at least) __blkdev_issue_discard do not expect. Revert the change for now, we'll redo it for the next merge window after refactoring the code to better split the file system vs ioctl callers and cleaning up a few other loose ends. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240314021623.1908895-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> * lsm: use 32-bit compatible data types in LSM syscalls Change the size parameters in lsm_list_modules(), lsm_set_self_attr() and lsm_get_self_attr() from size_t to u32. This avoids the need to have different interfaces for 32 and 64 bit systems. Cc: stable@vger.kernel.org Fixes: a04a1198088a ("LSM: syscalls for current process attributes") Fixes: ad4aff9ec25f ("LSM: Create lsm_list_modules system call") Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reported-and-reviewed-by: Dmitry V. Levin <ldv@strace.io> [PM: subject and metadata tweaks, syscall.h fixes] Signed-off-by: Paul Moore <paul@paul-moore.com> * lsm: handle the NULL buffer case in lsm_fill_user_ctx() Passing a NULL buffer into the lsm_get_self_attr() syscall is a valid way to quickly determine the minimum size of the buffer needed to for the syscall to return all of the LSM attributes to the caller. Unfortunately we/I broke that behavior in commit d7cf3412a9f6 ("lsm: consolidate buffer size handling into lsm_fill_user_ctx()") such that it returned an error to the caller; this patch restores the original desired behavior of using the NULL buffer as a quick way to correctly size the attribute buffer. Cc: stable@vger.kernel.org Fixes: d7cf3412a9f6 ("lsm: consolidate buffer size handling into lsm_fill_user_ctx()") Signed-off-by: Paul Moore <paul@paul-moore.com> * block: fix mismatched kerneldoc function name No functional modification involved. block/blk-settings.c:281: warning: expecting prototype for queue_limits_commit_set(). Prototype was for queue_limits_set() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8539 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240314025615.71269-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk> * ocfs2: remove SLAB_MEM_SPREAD flag usage The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove its usage so we can delete it from slab. No functional change. Link: https://lkml.kernel.org/r/20240224135008.829878-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * ocfs2: enable ocfs2_listxattr for special files For special files in S_IFBLK/S_IFCHR/S_IFIFO type, we already have ocfs2_setattr and ocfs2_getattr enabled. It's confusing for user space if it can use setattr/getattr to control one attribute appointed but can not list attributes using listxattr for above type files: $ mknod /mnt/b b 0 0 $ setfattr -h -n trusted.name -v 0xbabe /mnt/b $ getfattr -n trusted.name /mnt/b getfattr: Removing leading '/' from absolute path names trusted.name=0sur4= $ getfattr -m trusted /mnt/b $ Fix it by enabling ocfs2_listxattr for ocfs2_special_file_iops. After the commit, fstests/generic/062 will pass. Link: https://lkml.kernel.org/r/20240312042908.8889-1-l@damenly.org Signed-off-by: Su Yue <glass.su@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> * nilfs2: fix failure to detect DAT corruption in btree and direct mappings Patch series "nilfs2: fix kernel bug at submit_bh_wbc()". This resolves a kernel BUG reported by syzbot. Since there are two flaws involved, I've made each one a separate patch. The first patch alone resolves the syzbot-reported bug, but I think both fixes should be sent to stable, so I've tagged them as such. This patch (of 2): Syzbot has reported a kernel bug in submit_bh_wbc() when writing file data to a nilfs2 file system whose metadata is corrupted. There are two flaws involved in this issue. The first flaw is that when nilfs_get_block() locates a data block using btree or direct mapping, if the disk address translation routine nilfs_dat_translate() fails with internal code -ENOENT due to DAT metadata corruption, it can be passed back to nilfs_get_block(). This causes nilfs_get_block() to misidentify an existing block as non-existent, causing both data block lookup and insertion to fail inconsistently. The second flaw is that nilfs_get_block() returns a successful status in this inconsistent state. This causes the caller __block_write_begin_int() or others to request a read even though the buffer is not mapped, resulting in a BUG_ON check for the BH_Mapped flag in submit_bh_wbc() failing. This fixes the first issue by changing the return value to code -EINVAL when a conversion using DAT fails with code -ENOENT, avoiding the conflicting condition that leads to the kernel bug described above. Here, code -EINVAL indicates that metadata corruption was detected during the block lookup, which will be properly handled as a file system error and converted to -EIO when passing through the nilfs2 bmap layer. Link: https://lkml.kernel.org/r/20240313105827.5296-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240313105827.5296-2-konishi.ryusuke@gmail.com Fixes: c3a7abf06ce7 ("nilfs2: support contiguous lookup of blocks") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+cfed5b56649bddf80d6e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cfed5b56649bddf80d6e Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> *…

github-actions bot added util-lib please-review PR is ready for (re-)review by a maintainer labels Mar 11, 2024

poettering reviewed Mar 11, 2024

View reviewed changes

src/basic/pidref.c Outdated Show resolved Hide resolved

poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels Mar 11, 2024

YHNdnzj force-pushed the pidref-equal branch from f83833f to 1e55bf9 Compare March 11, 2024 09:57

YHNdnzj added ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Mar 11, 2024

github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Mar 11, 2024

YHNdnzj force-pushed the pidref-equal branch 2 times, most recently from b457e6d to 3e55cfc Compare March 11, 2024 12:23

poettering reviewed Mar 11, 2024

View reviewed changes

src/basic/stat-util.c Show resolved Hide resolved

poettering reviewed Mar 11, 2024

View reviewed changes

src/basic/stat-util.c Show resolved Hide resolved

poettering added good-to-merge/with-minor-suggestions and removed please-review PR is ready for (re-)review by a maintainer labels Mar 11, 2024

YHNdnzj added 3 commits March 11, 2024 22:53

stat-util: introduce fd_inode_same

0cdb8df

pidref: use fd_inode_same to compare pidfds

2f41f10

YHNdnzj force-pushed the pidref-equal branch from 3e55cfc to 2f41f10 Compare March 11, 2024 14:58

YHNdnzj added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed good-to-merge/with-minor-suggestions labels Mar 11, 2024

YHNdnzj added the ci-failure-appears-unrelated label Mar 12, 2024

YHNdnzj merged commit 18eebde into systemd:main Mar 12, 2024
44 of 48 checks passed

YHNdnzj deleted the pidref-equal branch March 12, 2024 06:15

github-actions bot removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pidref: use fd_inode_same to compare pidfds #31713

pidref: use fd_inode_same to compare pidfds #31713

YHNdnzj commented Mar 11, 2024

YHNdnzj commented Mar 11, 2024

YHNdnzj commented Mar 12, 2024

pidref: use fd_inode_same to compare pidfds #31713

pidref: use fd_inode_same to compare pidfds #31713

Conversation

YHNdnzj commented Mar 11, 2024

YHNdnzj commented Mar 11, 2024

YHNdnzj commented Mar 12, 2024