Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] mm: Protect clean file pages under memory pressure to prevent thrashing, avoid high latency and prevent livelock in near-OOM conditions #218

Closed
hakavlad opened this issue Jul 17, 2021 · 10 comments

Comments

@hakavlad
Copy link

@damentz
Copy link
Member

damentz commented Jul 17, 2021

Thanks, this patchset looks safe and straightforward to add. I ended up using the 5.14 release candidate for the 5.13 branch since 5.13's memory subsystem changed quite a bit since rc2.

Also, with both 5.12 and 5.13, I set a default value of effectively 256mb soft and 64mb hard cache protection. Should make it easier for any package maintainers to pick up this change without needing to research too much into it.

@damentz damentz closed this as completed Jul 17, 2021
@hakavlad
Copy link
Author

64mb hard cache protection

Note that hard protection may cause these issues with DRM/i915:

https://github.com/hakavlad/le9-patch#warnings

Soft protection should be safe and 256MB is OK.

damentz added a commit that referenced this issue Jul 17, 2021
Per @hakavlad [1], enabling hard cache protection breaks i915.  Keep it
at zero so we don't run into the same issues previously reported.

[1] #218 (comment)
damentz added a commit that referenced this issue Jul 17, 2021
Per @hakavlad [1], enabling hard cache protection breaks i915.  Keep it
at zero so we don't run into the same issues previously reported.

[1] #218 (comment)
@damentz
Copy link
Member

damentz commented Jul 17, 2021

@hakavlad thanks for the warning, dropped CLEAN_MIN_KBYTES to zero.

@hakavlad
Copy link
Author

Also note that le9 doesn't work with mg-LRU [1]. If you are using mg-LRU, than le9 will not work (mg-LRU doesn't use get_scan_count()).

[1] https://forum.xanmod.org/thread-4102-post-7599.html#pid7599

kernelOfTruth pushed a commit to kernelOfTruth/linux that referenced this issue Jul 20, 2021
Per @hakavlad [1], enabling hard cache protection breaks i915.  Keep it
at zero so we don't run into the same issues previously reported.

[1] zen-kernel#218 (comment)

Signed-off-by: kernelOfTruth <kerneloftruth@gmail.com>
@travankor
Copy link

@yuzhaogoogle "Protecting file pages with mgLRU will require a completely different approach than in le9. I do not yet know how to implement protection of file pages when using mgLRU."

Do you plan to implement this for mgLRU?

@yuzhaogoogle
Copy link

Yes, sometime next week.

@yuzhaogoogle
Copy link

Circling back on this: now echo 1000 >/sys/kernel/mm/lru_gen/min_ttl_ms can protect both anon and file memory that is used within the last 1000ms. Feel free to adjust this value to fit into your use cases. @travankor

damentz added a commit that referenced this issue Aug 11, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
@travankor
Copy link

I will try this out soon, thanks. What are the trade-offs when tweaking (larger and smaller) this with sysctl?

@yuzhaogoogle
Copy link

Oom kill more aggressively (more get killed) but more responsive user experience
or
oom kill less aggressively (fewer get killed) but less responsive user experience.

Does it make sense?

@travankor
Copy link

Yes, I understand. So userspace OOM daemons, like earlyoom, are no longer needed with mgLRU?

heftig pushed a commit that referenced this issue Sep 3, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Sep 3, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Sep 30, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Oct 10, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Oct 10, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Nov 10, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Nov 11, 2021
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Jan 16, 2022
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Mar 23, 2022
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue May 22, 2022
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Aug 1, 2022
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Oct 21, 2022
[ Upstream commit d8c22c4 ]

Disabling the remote phy for a SATA disk causes a hang:

root@(none)$ more /sys/class/sas_phy/phy-0:0:8/target_port_protocols
sata
root@(none)$ echo 0 > sys/class/sas_phy/phy-0:0:8/enable
root@(none)$ [   67.855950] sas: ex 500e004aaaaaaa1f phy08 change count has changed
[   67.920585] sd 0:0:2:0: [sdc] Synchronizing SCSI cache
[   67.925780] sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=DRIVER_OK
[   67.935094] sd 0:0:2:0: [sdc] Stopping disk
[   67.939305] sd 0:0:2:0: [sdc] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=DRIVER_OK
...
[  123.998998] INFO: task kworker/u192:1:642 blocked for more than 30 seconds.
[  124.005960]   Not tainted 6.0.0-rc1-205202-gf26f8f761e83 #218
[  124.012049] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  124.019872] task:kworker/u192:1  state:D stack:0 pid:  642 ppid: 2 flags:0x00000008
[  124.028223] Workqueue: 0000:04:00.0_event_q sas_port_event_worker
[  124.034319] Call trace:
[  124.036758]  __switch_to+0x128/0x278
[  124.040333]  __schedule+0x434/0xa58
[  124.043820]  schedule+0x94/0x138
[  124.047045]  schedule_timeout+0x2fc/0x368
[  124.051052]  wait_for_completion+0xdc/0x200
[  124.055234]  __flush_workqueue+0x1a8/0x708
[  124.059328]  sas_porte_broadcast_rcvd+0xa8/0xc0
[  124.063858]  sas_port_event_worker+0x60/0x98
[  124.068126]  process_one_work+0x3f8/0x660
[  124.072134]  worker_thread+0x70/0x700
[  124.075793]  kthread+0x1a4/0x1b8
[  124.079014]  ret_from_fork+0x10/0x20

The issue is that the per-device running_req read in
pm8001_dev_gone_notify() never goes to zero and we never make progress.
This is caused by missing accounting for running_req for when an internal
abort command completes.

In commit 2cbbf48 ("scsi: pm8001: Use libsas internal abort support")
we started to send internal abort commands as a proper sas_task. In this
when we deliver a sas_task to HW the per-device running_req is incremented
in pm8001_queue_command(). However it is never decremented for internal
abort commnds, so decrement in pm8001_mpi_task_abort_resp().

Link: https://lore.kernel.org/r/1663854664-76165-1-git-send-email-john.garry@huawei.com
Fixes: 2cbbf48 ("scsi: pm8001: Use libsas internal abort support")
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
damentz pushed a commit that referenced this issue Oct 24, 2022
[ Upstream commit d8c22c4 ]

Disabling the remote phy for a SATA disk causes a hang:

root@(none)$ more /sys/class/sas_phy/phy-0:0:8/target_port_protocols
sata
root@(none)$ echo 0 > sys/class/sas_phy/phy-0:0:8/enable
root@(none)$ [   67.855950] sas: ex 500e004aaaaaaa1f phy08 change count has changed
[   67.920585] sd 0:0:2:0: [sdc] Synchronizing SCSI cache
[   67.925780] sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=DRIVER_OK
[   67.935094] sd 0:0:2:0: [sdc] Stopping disk
[   67.939305] sd 0:0:2:0: [sdc] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=DRIVER_OK
...
[  123.998998] INFO: task kworker/u192:1:642 blocked for more than 30 seconds.
[  124.005960]   Not tainted 6.0.0-rc1-205202-gf26f8f761e83 #218
[  124.012049] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  124.019872] task:kworker/u192:1  state:D stack:0 pid:  642 ppid: 2 flags:0x00000008
[  124.028223] Workqueue: 0000:04:00.0_event_q sas_port_event_worker
[  124.034319] Call trace:
[  124.036758]  __switch_to+0x128/0x278
[  124.040333]  __schedule+0x434/0xa58
[  124.043820]  schedule+0x94/0x138
[  124.047045]  schedule_timeout+0x2fc/0x368
[  124.051052]  wait_for_completion+0xdc/0x200
[  124.055234]  __flush_workqueue+0x1a8/0x708
[  124.059328]  sas_porte_broadcast_rcvd+0xa8/0xc0
[  124.063858]  sas_port_event_worker+0x60/0x98
[  124.068126]  process_one_work+0x3f8/0x660
[  124.072134]  worker_thread+0x70/0x700
[  124.075793]  kthread+0x1a4/0x1b8
[  124.079014]  ret_from_fork+0x10/0x20

The issue is that the per-device running_req read in
pm8001_dev_gone_notify() never goes to zero and we never make progress.
This is caused by missing accounting for running_req for when an internal
abort command completes.

In commit 2cbbf48 ("scsi: pm8001: Use libsas internal abort support")
we started to send internal abort commands as a proper sas_task. In this
when we deliver a sas_task to HW the per-device running_req is incremented
in pm8001_queue_command(). However it is never decremented for internal
abort commnds, so decrement in pm8001_mpi_task_abort_resp().

Link: https://lore.kernel.org/r/1663854664-76165-1-git-send-email-john.garry@huawei.com
Fixes: 2cbbf48 ("scsi: pm8001: Use libsas internal abort support")
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
heftig pushed a commit that referenced this issue Dec 11, 2022
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Feb 20, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Apr 29, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Jun 26, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Aug 28, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Oct 18, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Oct 29, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
damentz added a commit that referenced this issue Oct 30, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Oct 31, 2023
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Jan 9, 2024
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue Feb 17, 2024
Distinguish between xe_pt and the xe_pt_dir subclass when
allocating and freeing. Also use a fixed-size array for the
xe_pt_dir page entries to make life easier for dynamic range-
checkers. Finally rename the page-directory child pointer array
to "children".

While no functional change, this fixes ubsan splats similar to:

[   51.463021] ------------[ cut here ]------------
[   51.463022] UBSAN: array-index-out-of-bounds in drivers/gpu/drm/xe/xe_pt.c:47:9
[   51.463023] index 0 is out of range for type 'xe_ptw *[*]'
[   51.463024] CPU: 5 PID: 2778 Comm: xe_vm Tainted: G     U             6.8.0-rc1+ #218
[   51.463026] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[   51.463027] Call Trace:
[   51.463028]  <TASK>
[   51.463029]  dump_stack_lvl+0x47/0x60
[   51.463030]  __ubsan_handle_out_of_bounds+0x95/0xd0
[   51.463032]  xe_pt_destroy+0xa5/0x150 [xe]
[   51.463088]  __xe_pt_unbind_vma+0x36c/0x9b0 [xe]
[   51.463144]  xe_vm_unbind+0xd8/0x580 [xe]
[   51.463204]  ? drm_exec_prepare_obj+0x3f/0x60 [drm_exec]
[   51.463208]  __xe_vma_op_execute+0x5da/0x910 [xe]
[   51.463268]  ? __drm_gpuvm_sm_unmap+0x1cb/0x220 [drm_gpuvm]
[   51.463272]  ? radix_tree_node_alloc.constprop.0+0x89/0xc0
[   51.463275]  ? drm_gpuva_it_remove+0x1f3/0x2a0 [drm_gpuvm]
[   51.463279]  ? drm_gpuva_remove+0x2f/0xc0 [drm_gpuvm]
[   51.463283]  xe_vm_bind_ioctl+0x1a55/0x20b0 [xe]
[   51.463344]  ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[   51.463414]  drm_ioctl_kernel+0xb6/0x120
[   51.463416]  drm_ioctl+0x287/0x4e0
[   51.463418]  ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[   51.463481]  __x64_sys_ioctl+0x94/0xd0
[   51.463484]  do_syscall_64+0x86/0x170
[   51.463486]  ? syscall_exit_to_user_mode+0x7d/0x200
[   51.463488]  ? do_syscall_64+0x96/0x170
[   51.463490]  ? do_syscall_64+0x96/0x170
[   51.463492]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   51.463494] RIP: 0033:0x7f246bfe817d
[   51.463498] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[   51.463501] RSP: 002b:00007ffc1bd19ad0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   51.463502] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f246bfe817d
[   51.463504] RDX: 00007ffc1bd19b60 RSI: 0000000040886445 RDI: 0000000000000003
[   51.463505] RBP: 00007ffc1bd19b20 R08: 0000000000000000 R09: 0000000000000000
[   51.463506] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc1bd19b60
[   51.463508] R13: 0000000040886445 R14: 0000000000000003 R15: 0000000000010000
[   51.463510]  </TASK>
[   51.463517] ---[ end trace ]---

v2
- Fix kerneldoc warning (Matthew Brost)

Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240209112655.4872-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 157261c)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
heftig pushed a commit that referenced this issue Mar 11, 2024
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue May 15, 2024
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
heftig pushed a commit that referenced this issue May 15, 2024
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In #218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants