Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible VC4 memory leak introduced in 5.10.47/5.10.48 #4474

Closed
kwirk opened this issue Jul 25, 2021 · 14 comments
Closed

Possible VC4 memory leak introduced in 5.10.47/5.10.48 #4474

kwirk opened this issue Jul 25, 2021 · 14 comments

Comments

@kwirk
Copy link

kwirk commented Jul 25, 2021

Describe the bug
After upgrading from 5.10.46 to 5.10.48, I had issues with oom-killer killing Kodi in about 20mins of starting to watch a video. The system then became unstable, with other processes being randomly killed, and a system reboot was required.

To reproduce
Play video on Kodi. Monitor kmalloc-128 increasing via slabtop -s c

Expected behaviour
Kodi not be oom-killed

Actual behaviour
Kodi oom-killed, along with other processes. System unstable requiring reboot.

System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

Model: Raspberry Pi 4
OS: Arch Linux ARM
Firmware: Jul 21 2021 16:21:46 version 6a796bb0062a6c75191c57cba1c13f9300076d02
Kernel: 5.10.52 (issue seen since updating to 5.10.48; but may have also been present on slightly earlier version)

Logs

$ sudo slabtop -s c
Active / Total Objects (% used)    : 266001 / 289256 (92.0%)
 Active / Total Slabs (% used)      : 13992 / 13992 (100.0%)
 Active / Total Caches (% used)     : 137 / 265 (51.7%)
 Active / Total Size (% used)       : 93490.38K / 103413.99K (90.4%)
 Minimum / Average / Maximum Object : 0.17K / 0.36K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 83628  83586  99%    0.38K   3987       21     31896K kmalloc-128
 15316  14367  93%    0.55K    547       28      8752K inode_cache
 30720  30691  99%    0.27K   1024       30      8192K ext4_groupinfo_4k
 31424  28348  90%    0.25K   1964       16      7856K kmalloc-64
 24975  17937  71%    0.29K    925       27      7400K dentry
 26192  26153  99%    0.24K   1637       16      6548K kernfs_node_cache
 11169  11152  99%    0.45K    657       17      5256K radix_tree_node
  3728   1177  31%    0.95K    233       16      3728K ext4_inode_cache
 11222  10974  97%    0.26K    362       31      2896K vm_area_struct
  1168   1127  96%    2.00K     73       16      2336K kmalloc-1k
  1872   1779  95%    1.00K    117       16      1872K kmalloc-512
   392    378  96%    3.94K     49        8      1568K task_struct

Noted high number of drm_atomic_helper_setup_commit:

$ sudo sort -rn /sys/kernel/slab/kmalloc-128/alloc_calls | head
  80462 drm_atomic_helper_setup_commit+0x21c/0x7bc [drm_kms_helper] age=3/84268/167031 pid=276-602 cpus=0-3
    153 drm_property_create+0xa0/0x190 [drm] age=167039/167040/167044 pid=276 cpus=1
     93 snd_info_create_entry+0x28/0xe0 [snd] age=167039/167148/167184 pid=244-276 cpus=1
     47 drm_mode_duplicate+0x20/0x4c [drm] age=167032/167032/167032 pid=276 cpus=1
     28 drm_universal_plane_init+0x118/0x604 [drm] age=167039/167039/167039 pid=276 cpus=1
     17 __ipv6_dev_mc_inc+0x154/0x3c4 [ipv6] age=165848/166512/167270 pid=1-594 cpus=0-3
     14 v3d_fence_create+0x28/0x9c [v3d] age=3768/64040/165436 pid=322-323 cpus=0-2
     11 devm_snd_soc_register_component+0x2c/0x80 [snd_soc_core] age=167040/167054/167107 pid=261-276 cpus=1-2
     10 snd_ctl_new+0x34/0x8c [snd] age=167039/167068/167182 pid=244-276 cpus=1
      9 __drmm_add_action_or_reset+0x14/0x30 [drm] age=167038/167047/167078 pid=271-276 cpus=1

I downgraded to 5.10.46 and this has resolved the issue:

$ sudo grep drm_atomic_helper_setup_commit /sys/kernel/slab/kmalloc-128/alloc_calls
      9 drm_atomic_helper_setup_commit+0x214/0x790 [drm_kms_helper] age=1/236240/305496 pid=113-590 cpus=0-3
      1 drm_atomic_helper_setup_commit+0x538/0x790 [drm_kms_helper] age=1 pid=590 cpus=1
      1 drm_atomic_helper_setup_commit+0x6f4/0x790 [drm_kms_helper] age=304206 pid=590 cpus=0

Additional context
Also report on anholt#156
Forum post: https://www.raspberrypi.org/forums/viewtopic.php?f=53&t=316394

@kwirk
Copy link
Author

kwirk commented Jul 25, 2021

I've also built latest rpi-5.10.y (7aa66c0) including #4470 applied as a patch, which results in same behaviour: increasing number of allocation calls from drm_atomic_helper_setup_commit in kmalloc-128.

@popcornmix
Copy link
Collaborator

I would be suspecting one of these : #4441
Can you try building at 3282562~ (i.e. one commit before that PR) and at 3d7507e (i.e. end of PR) and check if each is okay?

@kwirk
Copy link
Author

kwirk commented Jul 25, 2021

So I'm currently running 3282562~ (64f5ffe) without issue (was packaged as 5.10.42-2 on Arch Linux ARM).

Currently building 3d7507e

@kwirk
Copy link
Author

kwirk commented Jul 25, 2021

Looks like 3d7507e has the issue.

I saw on #4441 f14d057 introduced a atomic_commit_setup function and 2f3e2d8 switched to using drm_atomic_helper_commit, and drm_atomic_helper_setup_commit is where I'm seeing constant allocations coming from…

@popcornmix
Copy link
Collaborator

Sounds likely. Are you able to confirm memory leak starts with that commit?

@kwirk
Copy link
Author

kwirk commented Jul 25, 2021

So the issue was introduced with 135abe8. I also applied e621e4e as a patch on top of this, as noted this commit related to same section of code (I had tested e621e4e on latest rpi-5.10.y (7aa66c0) previously but wanted to double check), but the issue persists.

@popcornmix
Copy link
Collaborator

@mripard any thoughts on a memory leak introduced with "drm/vc4: kms: Wait on previous FIFO users before a commit"?

@mripard
Copy link
Contributor

mripard commented Jul 26, 2021

I think that we have the drm_crtc_commit ref counting wrong somehow and thus never free it for each commit.

However, I started to run an application that does a commit each frame, and while I can definitely see the kmalloc-128 slab increase over time, I'm not getting the drm_atomic_helper_setup_commit call, at all, which is fairly weird.

@kwirk
Copy link
Author

kwirk commented Jul 26, 2021

Built and tested latest rpi-5.10.y (9a22b00), now #4478 is merged. Working fine so far with no sign on kmalloc-128 increasing 👍

@Marcool04
Copy link

I have been experiencing out of memory issues on a raspberry pi 4, 8Gb RAM, armv7h system, using kernel 5.10.52. In my case, the reversions in pull request #4478 have not solved the issue. Archlinuxarm linux-raspberrypi4 package at version 5.10.52-5 is built from commit 6495fa3, and therefore includes the reversions made in #4478 that are supposed to have solved this issue according to above comment, but still has the oom behaviour on my system.

Change occurred somewhere between 2697f74 (which I'm running now with no oom issues) and 86729e7 (when they started, and were not fixed by #4478). At least one other user has reported a similar situation on the archlinuxarm forum: https://archlinuxarm.org/forum/viewtopic.php?f=60&t=15460&p=67115#p67115

It would seem that on aarch64 systems the reversions did indeed fix the problem though.

@kwirk
Copy link
Author

kwirk commented Aug 3, 2021

@Marcool04: It would be interesting to see if you have the memory leak in the same place, as I don't believe the code which caused this memory leak is reachable (not 100% confident however…).

Could you add slub_debug=U to your boot parameters (in /boot/cmdline.txt) and look at slabtop -s -c and allloc_calls similar to my original comment?

@kwirk
Copy link
Author

kwirk commented Aug 3, 2021

I've been using 5.10.52-4 without issue (86729e7), and just upgraded to 5.10.52-6 (4e5d621), and again no issues.

$ uname -mr
5.10.52-6-ARCH armv7l

@Marcool04
Copy link

Indeed @kwirk there would appear to be some error on my part. Either this has been resolved between 5.10.52-5 and 5.10.52-6 (but no commit in between seems to indicate it should have), or I have an issue somewhere else, as I am now running the latest archlinuxarm kernel just fine…

# uname -mr
5.10.52-6-ARCH armv7l
# slabtop -s c
Active / Total Objects (% used)    : 304838 / 309615 (98.5%)
 Active / Total Slabs (% used)      : 15203 / 15203 (100.0%)
 Active / Total Caches (% used)     : 141 / 270 (52.2%)
 Active / Total Size (% used)       : 103991.14K / 106731.19K (97.4%)
 Minimum / Average / Maximum Object : 0.17K / 0.34K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 13776  13766  99%    0.95K    861       16     13776K ext4_inode_cache
 37557  37535  99%    0.29K   1391       27     11128K dentry
 21828  21813  99%    0.45K   1284       17     10272K radix_tree_node
 16016  15974  99%    0.55K    572       28      9152K inode_cache
 31651  31627  99%    0.26K   1021       31      8168K vm_area_struct
  1576   1571  99%    3.94K    197        8      6304K task_struct
 23856  23802  99%    0.24K   1491       16      5964K kernfs_node_cache
431086  31073  99%    0.18K   1413       22      5652K dm_exception
 27048  27047  99%    0.19K   1288       21      5152K anon_vma_chain
 20512  20199  98%    0.25K   1282       16      5128K kmalloc-64
 18090  18073  99%    0.22K   1005       18      4020K buffer_head
52  14646  99  98%    0.21K    814       18      3256K anon_vma
 14070  14055  99%    0.19K    670       21      2680K ext4_extent_status
  1808   1661  91%    1.00K    113       16      1808K kmalloc-512
1  816    772  94%    2.00K     51       16      1632K kmalloc-1k
906  99  2886  98%    0.38K    139       21      1112K kmalloc-128
# sort -rn /sys/kernel/slab/kmalloc-128/alloc_calls | head
    897 nf_ct_ext_add+0x88/0x104 [nf_conntrack] age=5/3598/9854 pid=0-2533 cpus=0-3
    153 drm_property_create+0xa0/0x190 [drm] age=9740/9744/9757 pid=81 cpus=3
     49 init_rescuer+0x3c/0xe0 age=9766/12459/13848 pid=1-479 cpus=0-3
     42 create_worker+0x58/0x1e0 age=7772/12630/13848 pid=1-560 cpus=0-3
     39 add_sysfs_param+0x3c/0x1d0 age=9778/12740/13827 pid=1-480 cpus=0-3
     33 snd_info_create_entry+0x28/0xe0 [snd] age=9749/9772/9815 pid=81-471 cpus=1,3
     30 __devm_request_region+0x34/0xac age=9741/12089/13831 pid=1-480 cpus=0-3
     28 drm_universal_plane_init+0x118/0x604 [drm] age=9740/9740/9747 pid=81 cpus=3
     14 __ipv6_dev_mc_inc+0x154/0x3c4 [ipv6] age=9540/10017/10340 pid=1-574 cpus=1,3

This time, the difference is I did not upgrade my AUR packages (which I usually do at the same time as I upgrade with pacman)…

Maybe one of electron12-bin-12.0.15-1 highway-0.14.0-1 moarvm-2021.07-1 nqp-2021.07-1 pbget-2021.7-1 pm2ml-2021.8-1 python-mautrix-0.9.10-1 python-tifffile-2021.7.30-1 python3-xcpf-2021-1 rakudo-2021.07-1 rompr-1.59.1-1 mx-puppet-discord-git-r248.c17384a-3 is causing a memory leak… All of those packages are used by systemd services that start on boot so they are candidates. I'll have to upgrade them one by one and see I guess!
Thanks for taking the time to help.
Regards,
Mark.

@telsch
Copy link

telsch commented Aug 4, 2021

It looks like 5.10.52-6 don't invoked oom-killer for docker anymore.

# uname -mr
5.10.52-6-ARCH armv7l
# slabtop -s c
 Active / Total Objects (% used)    : 601513 / 662340 (90.8%)
 Active / Total Slabs (% used)      : 32688 / 32688 (100.0%)
 Active / Total Caches (% used)     : 144 / 267 (53.9%)
 Active / Total Size (% used)       : 199446.96K / 220504.55K (90.5%)
 Minimum / Average / Maximum Object : 0.16K / 0.33K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 34512  29219  84%    0.95K   2157       16     34512K ext4_inode_cache
126450  94694  74%    0.22K   7025       18     28100K buffer_head
101703 100428  98%    0.19K   4843       21     19372K anon_vma_chain
 71703  70724  98%    0.26K   2313       31     18504K vm_area_struct
 58725  56818  96%    0.29K   2175       27     17400K dentry
 21710  21426  98%    0.59K    835       26     13360K proc_inode_cache
 21420  19534  91%    0.45K   1260       17     10080K radix_tree_node
 40064  33325  83%    0.25K   2504       16     10016K kmalloc-64
 17332  17181  99%    0.55K    619       28      9904K inode_cache
 35696  35694  99%    0.24K   2231       16      8924K kernfs_node_cache
 33642  33136  98%    0.21K   1869       18      7476K anon_vma
 15150  15140  99%    0.27K    505       30      4040K ext4_groupinfo_4k
 20000  19331  96%    0.20K   1000       20      4000K jbd2_inode
  6422   3127  48%    0.61K    247       26      3952K ovl_inode
  1840   1821  98%    2.00K    115       16      3680K kmalloc-1k
   872    830  95%    3.94K    109        8      3488K task_struct
   416    405  97%    8.00K    104        4      3328K kmalloc-4k
  2832   2782  98%    1.00K    177       16      2832K kmalloc-512
   704    689  97%    4.00K     88        8      2816K kmalloc-2k
  5544   5029  90%    0.38K    264       21      2112K filp
  5040   4598  91%    0.38K    240       21      1920K kmalloc-128
  2652   2632  99%    0.61K    102       26      1632K shmem_inode_cache
  1302   1221  93%    0.75K     62       21       992K sock_inode_cache
  5019   5019 100%    0.19K    239       21       956K ftrace_event_field
  2142   2123  99%    0.38K    102       21       816K kmalloc-192
    48     47  97%   16.00K     24        2       768K kmalloc-8k
  1848   1781  96%    0.38K     88       21       704K mnt_cache
  2436   2436 100%    0.28K     87       28       696K proc_dir_entry
  1312   1303  99%    0.50K     82       16       656K kmalloc-256
   374    349  93%    1.44K     17       22       544K sighand_cache
   442    381  86%    0.94K     26       17       416K UNIX
  1900   1900 100%    0.20K    100       19       400K trace_event_file
  1869   1849  98%    0.19K     89       21       356K ext4_extent_status
   396    349  88%    0.88K     22       18       352K signal_cache
   110     94  85%    3.19K     11       10       352K biovec-max
   950    585  61%    0.31K     38       25       304K cred_jar
   475    206  43%    0.62K     19       25       304K mm_struct
   777    637  81%    0.38K     37       21       296K kmalloc-rcl-128
# sort -rn /sys/kernel/slab/kmalloc-128/alloc_calls | head
    261 nf_ct_ext_add+0x88/0x104 [nf_conntrack] age=98/41581/231480 pid=0-31581 cpus=0-3
    237 __ipv6_dev_mc_inc+0x154/0x3c4 [ipv6] age=775/211370/234159 pid=1-31481 cpus=0-3
    153 drm_property_create+0xa0/0x190 [drm] age=233509/233511/233517 pid=34 cpus=0
     33 snd_info_create_entry+0x28/0xe0 [snd] age=233515/233687/234032 pid=34-241 cpus=0,3
     31 __devm_request_region+0x34/0xac age=233510/234181/234883 pid=1-287 cpus=0-1
     28 drm_universal_plane_init+0x118/0x604 [drm] age=233509/233509/233514 pid=34 cpus=0
     23 seg6_net_init+0x20/0x84 [ipv6] age=851/206764/234159 pid=1-31481 cpus=0-3
     23 fib6_net_init+0xf8/0x1c8 [ipv6] age=851/206764/234159 pid=1-31481 cpus=0-3
     23 fib6_net_init+0xac/0x1c8 [ipv6] age=851/206764/234159 pid=1-31481 cpus=0-3
     22 xfrm6_net_init+0x7c/0x100 [ipv6] age=851/205519/233972 pid=255-31481 cpus=0-3

@kwirk kwirk closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants