Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workqueue: events_freezable mmc_rescan crash with Raspbian kernel 4.14.79-v7+ #2810

Open
op-ecoprintq opened this issue Jan 14, 2019 · 64 comments

Comments

@op-ecoprintq
Copy link

I have a few Raspberry Pi 3 B+ exhibiting the same problem. They crash after 2-3 days of uptime with the following error:

[169451.220021] INFO: task kworker/0:3:10949 blocked for more than 120 seconds.
[169451.220036] Tainted: G C 4.14.79-v7+ #1159
[169451.220041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[169451.220048] kworker/0:3 D 0 10949 2 0x00000000
[169451.220077] Workqueue: events_freezable mmc_rescan
[169451.220110] [<8079ef70>] (__schedule) from [<8079f5d8>] (schedule+0x50/0xa8)
[169451.220130] [<8079f5d8>] (schedule) from [<8061a2d0>] (__mmc_claim_host+0xb8/0x1cc)
[169451.220147] [<8061a2d0>] (__mmc_claim_host) from [<8061a414>] (mmc_get_card+0x30/0x34)
[169451.220163] [<8061a414>] (mmc_get_card) from [<80623010>] (mmc_sd_detect+0x20/0x74)
[169451.220179] [<80623010>] (mmc_sd_detect) from [<8061ccdc>] (mmc_rescan+0x1c8/0x394)
[169451.220197] [<8061ccdc>] (mmc_rescan) from [<801379b4>] (process_one_work+0x158/0x454)
[169451.220212] [<801379b4>] (process_one_work) from [<80137d14>] (worker_thread+0x64/0x5b8)
[169451.220227] [<80137d14>] (worker_thread) from [<8013dd98>] (kthread+0x13c/0x16c)
[169451.220246] [<8013dd98>] (kthread) from [<801080ac>] (ret_from_fork+0x14/0x28)

The machines are running Rasbian Stretch

@ldrolez
Copy link

ldrolez commented Jan 18, 2019

Same problem on a 3B after upgrading to 4.14.79-v7+
Crash every 1-3 days. Now downgrading to 4.14.52-v7+

@JamesH65
Copy link
Contributor

Hmm, what are the Pi's doing during that time? We haven't seen anything like this in the office, so would be interested to know what they are doing that might cause this.

@JamesH65
Copy link
Contributor

JamesH65 commented Jan 18, 2019

Actually, this is a duplicate of an issue on our Linux tracker which is the correct place for it to be. Please continue conversation there...#2810

@JamesH65 JamesH65 reopened this Jan 18, 2019
@JamesH65
Copy link
Contributor

Bum, closed wrong one. Reopened. Will now attempt to close the one in firmware...

@op-ecoprintq
Copy link
Author

So the machines in question are running LXDE, a Java application, and the onboard keyboard.

@pelwell
Copy link
Contributor

pelwell commented Jan 18, 2019

That doesn't tell me anything about their level of activity - CPU, file system, network etc.

@Alexurb
Copy link

Alexurb commented Jan 19, 2019

Same issue here. Attached is my kernel log.
kern.log
The issue occurs around Jan 18 19:57:02
This is on a headless install. The green light is constant, the PI responds to pings but is unable to open up an ssh session.
This actually occurs for me during a movie being streamed by plex , around an hour and a half in.
I left the pi to see if it would recover overnight, sadly it didn't.

@Alexurb
Copy link

Alexurb commented Jan 21, 2019

Happened again mid movie. Attached is another kernel log
kern.log

@Alexurb
Copy link

Alexurb commented Jan 23, 2019

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
summary:
sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
sudo sysctl -p
seems to do the trick so far, still testing (3 movies down)
I'll comment if it occurs again.
To make it permanent(after reboot) read the link

@Alexurb
Copy link

Alexurb commented Jan 23, 2019

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
summary:
sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
sudo sysctl -p
seems to do the trick so far, still testing (3 movies down)
I'll comment if it occurs again.
To make it permanent(after reboot) read the link

So far so good.

@peabody
Copy link

peabody commented Aug 29, 2019

Just going say this thread saved my pi 3b+. Was hanging terrible on any intense IO. The solution by Alexurb fixed the problem for me.

@jdrch
Copy link

jdrch commented Apr 4, 2020

Same problem here on my Pi 3B+ running Raspbian Buster. Gonna try that edit and then reboot. Fingers crossed!

@paralin
Copy link

paralin commented Sep 22, 2020

Just ran into this with kernel 5.4.51-v8 aarch64 on Pi 4.

Ill try the fix recommended above in this thread:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Here's the dmesg errors:

[11722.001585] INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
[11722.005139]       Tainted: G         C        5.4.51-v8 #1
[11722.008672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11722.012222] kworker/1:0     D    0  1663      2 0x00000028
[11722.015773] Workqueue: events_freezable mmc_rescan
[11722.019269] Call trace:
[11722.022724]  __switch_to+0x110/0x180
[11722.026149]  __schedule+0x2f4/0x750
[11722.029523]  schedule+0x44/0xe0
[11722.032807]  __mmc_claim_host+0xb8/0x210
[11722.036111]  mmc_get_card+0x38/0x50
[11722.039407]  mmc_sd_detect+0x24/0x90
[11722.042671]  mmc_rescan+0xc8/0x390
[11722.045892]  process_one_work+0x1c0/0x470
[11722.049078]  worker_thread+0x50/0x430
[11722.052405]  kthread+0x100/0x130
[11722.055223]  ret_from_fork+0x10/0x18

paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
@jdrch
Copy link

jdrch commented Nov 22, 2020

will Too much of write can reduce life of SD CARDS?
vm.dirty_background_ratio = 5
vm.dirty_ratio = 80
Can we try this setting.

My card hasn't failed since I made the change on April 4 of this year. I've been using the same SanDisk microSD card I bought new since mid-2018. I think you only need to worry about microSD card life if you're using super cheap or no name cards. If you use a high tier SanDisk you should be just fine.

@ranma
Copy link
Contributor

ranma commented Feb 23, 2021

Seeing the same, seems to be a memory leak in lxpanel causing it to run out of memory and start swapping until it dies.
htop output from ssh session I left running:

  Mem[||||||||||||||||||||||||||||||||||||||||||||882M/924M]   Uptime: 07:48:17
[...]
  719 pi         20   0  882M  488M   204 D  0.7 52.9 35:19.23 lxpanel --profile LXDE-pi
[...]

Straight after reboot:

  Mem[||||||||||||||||||||||||||||||||||||||||||||390M/924M]   Uptime: 00:03:50
[...]
 726 pi         20   0  401M 30128 23492 S  5.1  3.2  0:05.68 lxpanel --profile LXDE-pi
[...]

See also https://www.raspberrypi.org/forums/viewtopic.php?t=267015

@jdrch
Copy link

jdrch commented Feb 23, 2021

I haven't noticed this problem on my 3B+ since July 21 of last year, which is just before The Foundation engineer announced a fix had been committed.

Are you fully patched and updated on Buster and still seeing the issue?

One thing I'm thinking based on both my repo issue thread and the Raspberry Pi Forum one you linked to (thanks!) is this could be triggered by having the CPU % lxpanel plugin AND a process that causes CPU % to spike for a significant length of time, e.g. backintime as I did. If you read through my issue thread you'll notice the whole thing went away once I removed backintime.

@ranma
Copy link
Contributor

ranma commented Mar 6, 2021

This is on a 1GB 3B+ with a freshly installed and updated raspbian image, so should it is buster:
Raspberry Pi 3 Model B Plus Rev 1.3
Raspberry Pi reference 2021-01-11
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 21090519d85bdaa1615d5d5057d37b09368ea5d2, stage4
Linux raspberrypi 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021 armv7l GNU/Linux
Videocore Jan 27 2021 22:26:53

It was crashing every ~6 hours from high lxpanel memory usage until I set up a script to restart lxpanel once an hour.

I also added:
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

Besides lxde there's influxdb and a autostart chrome with grafana running, plus I'm using the official 7" DSI touch display.
It's actually still likes to crash but less often, I haven't investigated the more recent ones yet.

@ranma
Copy link
Contributor

ranma commented Mar 6, 2021

Actually it's still crashing from lxpanel it seems, so the hourly restart isn't good enough to prevent whatever lxpanel is doing.
I also would have expected the OOM-killer to kick in, but somehow the system seems to lock up instead?

htop excerpt:

  1  [||||||||                                        13.0%]   Tasks: 78, 157 thr; 3 running
  2  [|||||||||                                       15.3%]   Load average: 39.39 40.12 37.97 
  3  [|||||||||                                       14.3%]   Uptime: 07:53:55
  4  [||||||||||                                      16.8%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||881M/924M]
  Swp[||||||||||||||||||||||||||||||||||||||||100.0M/100.0M]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                         
  457 grafana    20   0  934M 10728     0 S 32.2  1.1 29:40.85 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  451 influxdb   20   0  978M  9476     0 D 28.4  1.0 35:29.89 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
  494 grafana    20   0  934M 10728     0 R 20.4  1.1 15:36.50 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  731 grafana    20   0  934M 10728     0 S 11.6  1.1  9:18.06 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
 1347 influxdb   20   0  978M  9476     0 D  6.1  1.0  5:24.33 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
[...]
  795 pi         20   0  636M 39568     0 D  1.0  4.2  3:43.82 /usr/lib/chromium-browser/chromium-browser-v7 --force-renderer-a
 1282 pi         20   0  429M 72396   372 D  0.8  7.6  4:00.12 /usr/lib/chromium-browser/chromium-browser-v7 --type=renderer --
[...]
 760 pi         20   0 1012M  591M     0 D  0.4 64.0 28:24.12 lxpanel --profile LXDE-pi

Kernel messages excerpt:

[  119.755918] logitech-hidpp-device 0003:046D:404D.0004: HID++ 4.1 device connected.
[14620.750775] INFO: task kworker/2:0:7862 blocked for more than 122 seconds.
[14620.750798]       Tainted: G         C        5.10.11-v7+ #1399
[14620.750807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14620.750818] task:kworker/2:0     state:D stack:    0 pid: 7862 ppid:     2 flags:0x00000000
[14620.750859] Workqueue: events_freezable mmc_rescan
[14620.750876] Backtrace:
[14620.750906] [<809e5190>] (__schedule) from [<809e5b60>] (schedule+0x68/0xe4)
[14620.750924]  r10:822b1800 r9:ffffe000 r8:00000000 r7:00000000 r6:40000113 r5:9d8b2e80
[14620.750934]  r4:ffffe000
[14620.750951] [<809e5af8>] (schedule) from [<807f72f8>] (__mmc_claim_host+0xe0/0x238)
[14620.750962]  r5:822b1a18 r4:00000002
[14620.750978] [<807f7218>] (__mmc_claim_host) from [<807f7488>] (mmc_get_card+0x38/0x3c)
[14620.750994]  r10:b6b53640 r9:00000000 r8:b6b56c00 r7:00000080 r6:822b1a18 r5:00000000
[14620.751005]  r4:822b2800
[14620.751041] [<807f7450>] (mmc_get_card) from [<808010a0>] (mmc_sd_detect+0x24/0x7c)
[14620.751053]  r5:822b1800 r4:822b1800
[14620.751071] [<8080107c>] (mmc_sd_detect) from [<807f9cc0>] (mmc_rescan+0xdc/0x3b0)
[14620.751083]  r5:822b1800 r4:822b1a7c
[14620.751103] [<807f9be4>] (mmc_rescan) from [<8013b78c>] (process_one_work+0x254/0x584)
[14620.751119]  r9:00000000 r8:b6b56c00 r7:00000080 r6:b6b53640 r5:89128200 r4:822b1a7c
[14620.751137] [<8013b538>] (process_one_work) from [<8013bb1c>] (worker_thread+0x60/0x5c4)
[14620.751153]  r10:b6b53640 r9:80e03d00 r8:b6b53658 r7:00000008 r6:b6b53640 r5:89128214
[14620.751164]  r4:89128200
[14620.751182] [<8013babc>] (worker_thread) from [<801435a4>] (kthread+0x170/0x174)
[14620.751198]  r10:adf9be74 r9:89128200 r8:8013babc r7:9b4ba000 r6:00000000 r5:ade02c80
[14620.751209]  r4:9170bf80
[14620.751225] [<80143434>] (kthread) from [<801000ec>] (ret_from_fork+0x14/0x28)
[14620.751236] Exception stack(0x9b4bbfb0 to 0x9b4bbff8)
[14620.751249] bfa0:                                     00000000 00000000 00000000 00000000
[14620.751264] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[14620.751277] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000  
[14620.751293]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80143434
[14620.751303]  r4:ade02c80

@jdrch
Copy link

jdrch commented Mar 6, 2021

Hmmm ... I would suggest disabling cron jobs and/or processes and services one by one and seeing how that goes. As I said, once I removed backintime the issue went away (or at least is now sufficiently infrequent for me to not notice it at all). While lxpanel is crashing, I believe in some cases the crash is being prompted by something else.

@ranma
Copy link
Contributor

ranma commented Mar 7, 2021

So I've enabled memory cgroups + swap on zram and put lxpanel into a limited cgroup.
As expected, this morning I found it almost maxing out it's allocation limit and heavily swapping (but only the lxpanel process, thanks to the cgroup).

Time for some investigation:

(gdb) attach 3924
Attaching to program: /usr/bin/lxpanel, process 3924
[New LWP 3928]
[New LWP 3942]
[New LWP 3948]
[New LWP 3988]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
(gdb) bt
#0  0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
#1  0x73824ae0 in ?? () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) gcore lxpanel.core
warning: target file /proc/3924/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 1048576 bytes at 0x60aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x64aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x68aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6cb00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6dd00000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6df00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6e700000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6e8ff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6f100000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6f900000.
warning: Memory read failed for corefile section, 393216 bytes at 0x6fa30000.
warning: Memory read failed for corefile section, 237568 bytes at 0x6fac5000.
warning: Memory read failed for corefile section, 135168 bytes at 0x70300000.
warning: Memory read failed for corefile section, 4096 bytes at 0x70465000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x70468000.
warning: Memory read failed for corefile section, 135168 bytes at 0x74600000.
warning: Memory read failed for corefile section, 8192 bytes at 0x75b32000.
warning: Memory read failed for corefile section, 12288 bytes at 0x7639d000.
warning: Memory read failed for corefile section, 131072 bytes at 0x76ed6000.
warning: Memory read failed for corefile section, 135168 bytes at 0x7ea79000.
Saved corefile lxpanel.core

Let's try looking at that core file

$ ls -lh lxpanel.core 
-rw-r--r-- 1 pi pi 432M Mar  7 10:37 lxpanel.core
$ strings lxpanel.core  | sort | uniq -c | sort -n -k1 -r | head -20
   6272 :1.18
   4376 daRvPaRv
   2569 org.bluez.Device1
   2414 org.freedesktop.DBus.Properties
   2080 org.freedesktop.DBus.Introspectable
   1595 RSSI
   1094 UUIDs
   1094 Alias
   1094 AddressType
   1094 Address
   1092 Trusted
   1092 ServicesResolved
   1092 Paired
   1092 LegacyPairing
   1092 Connected
   1092 Blocked
   1092 Adapter
    980 MXo=
    972 ManufacturerData
    888 /org/bluez/hci0

So I think it's likely that it's the bluetooth lxpanel plugin that is leaking memory in my case, I'll try disabling the bluetooth plugin and see how it behaves after that.

@ranma
Copy link
Contributor

ranma commented Mar 7, 2021

While lxpanal is crashing, I believe in some cases the crash is being prompted by something else.

Technically it looks more like a lockup from out-of-memory, not a crash.
lxpanel is the clear culprit because its resident set size is initially small, but is at 591M out of ~1GB total at the point of the lockup.
Also putting lxpanel into a cgroup prevents the lockup for me.

There's actually two separate issues here:

  1. The system seems to lock up instead of the OOM-killer triggering and killing the offending lxpanel process (at which point the system would recover and continue working since lxpanel is auto-respawned).
    It seems it might be that swapping on SD card is running into some kernel issue with the default kernel settings?

  2. The memory leak in lxpanel (or rather most likely the raspi-specific lxplug-bluetooth)
    I'll open a new bug at https://github.com/raspberrypi-ui/lxplug-bluetooth/ for this one, I think there are missing calls to g_variant_unref in cb_interface_properties. I'll also need to see if heaptrack is usable, I tried valgrinds massif yesterday and it was just too slow.

@samyk
Copy link

samyk commented Dec 4, 2021

@kmark in earlier testing (vmstat / iostat), I never saw memory run out. I can crash it by simply exhausting memory but I don't think that's what's happening, at least in my case.

Another interesting thing to note is that I can sometimes tell if it's going to happen as load starts to increase higher than usual. The interesting part is I can start killing off processes and that will often resolve it, load will go back down. This does make me think it could relate to your thought of hitting it with a fan. I may try getting a thermal camera on it and seeing if there's some external temp at which it seems to crash (though I'll likely need a blackbody / thermal reference). I'm also running /opt/vc/bin/vcgencmd measure_temp in a loop as well.

Regardless I now have a test running spitting out meminfo, load averages, and internal CPU temp in a loop and will report its output upon the last crash.

@ProactiveServices
Copy link

ProactiveServices commented Dec 4, 2021

For others who are experiencing this regularly, does taking the Pi out of its case, pointing a fan at it, etc. do anything to decrease the regularity of this issue? Trying to figure out if the heat is just an artifact of the lockup or if it's actually causing it somehow.

On my affected device it's kept in a cool room, no case, and with very little CPU or RAM used. Also, related to your summary post, I'm running a Pi 4 4GB from a USB SSD; no SD card in use. Apologies, wrong issue. Disregard.

@samyk
Copy link

samyk commented Dec 4, 2021

Here is the last 500 samples. I was sampling at 2 times per second, but the last 10 samples started slowing down and there was a 10 minute gap between the last 2 samples (simply due to load). Interestingly temperature was pretty steady at 80C the entire time but went down during the issue (where load spikes), leading me to believe it's more related to a race condition or SD/IO issue.

image

Here were some initial temp, load, and meminfo values
Sat Dec  4 06:33:07 2021:
load: 9.76 9.82 9.85 7/239 2008
temp=80.6'C

MemTotal:         947804 kB
MemFree:          322828 kB
MemAvailable:     350916 kB
Buffers:            8640 kB
Cached:           115572 kB
SwapCached:            0 kB
Active:           470436 kB
Inactive:          83028 kB
Active(anon):     431144 kB
Inactive(anon):    54424 kB
Active(file):      39292 kB
Inactive(file):    28604 kB
Unevictable:        1736 kB
Mlocked:            1736 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               568 kB
Writeback:             0 kB
AnonPages:        431148 kB
Mapped:            36852 kB
Shmem:             54872 kB
Slab:              32924 kB
SReclaimable:      10480 kB
SUnreclaim:        22444 kB
KernelStack:        1952 kB
PageTables:         3872 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      473900 kB
Committed_AS:    1469692 kB
VmallocTotal:    1114112 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:              416 kB
CmaTotal:         262144 kB
CmaFree:          242436 kB
And here were the last values
Sat Dec  4 08:43:16 2021:
load: 47.24 26.27 16.08 1/274 4977
temp=60.1'C

MemTotal:         947804 kB
MemFree:          258376 kB
MemAvailable:     246276 kB
Buffers:            1236 kB
Cached:            70400 kB
SwapCached:            0 kB
Active:           532660 kB
Inactive:          66548 kB
Active(anon):     529432 kB
Inactive(anon):    54388 kB
Active(file):       3228 kB
Inactive(file):    12160 kB
Unevictable:        1736 kB
Mlocked:            1736 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        529336 kB
Mapped:            16448 kB
Shmem:             54816 kB
Slab:              29964 kB
SReclaimable:       9548 kB
SUnreclaim:        20416 kB
KernelStack:        2184 kB
PageTables:         4968 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      473900 kB
Committed_AS:    1532524 kB
VmallocTotal:    1114112 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:              416 kB
CmaTotal:         262144 kB
CmaFree:          242460 kB

@jdrch
Copy link

jdrch commented Dec 4, 2021

SD/IO issue

Hmmmm .... is anyone with this problem not running the OS on an SD card? Maybe that's the common denominator here.

@paralin
Copy link

paralin commented Dec 4, 2021

@jdrch the issue is when the SD card stops responding or otherwise has some kind of error, and mmc_rescan gets stuck.

@samyk
Copy link

samyk commented Dec 5, 2021

@ProactiveServices to confirm, you have NOT experienced this using a USB SSD? You are the only user who has stated using a USB SSD but I see you were discussing a different issue. Curious if you have had this issue using an SSD?

Looks like this user resolved the issue by switching to USB SSD, but what is strange is that the issue seems kernel version dependent. raspberrypi/firmware#1522

@ProactiveServices
Copy link

@ProactiveServices to confirm, you have NOT experienced this using a USB SSD? You are the only user who has stated using a USB SSD but I see you were discussing a different issue. Curious if you have had this issue using an SSD?

I have not seen this issue since moving to SSDs. I mistook this issue for another. I moved away from SD cards because they were too unreliable!

@strasharo
Copy link

@jdrch Yup, running HASS with Unifi addon here on SD card, hitting it on 5.15.5-1-ARCH

@kmark
Copy link

kmark commented Dec 21, 2021

I have switched the Pi I've been hitting this with over to a USB SSD boot drive (no microSD present). I've also reinstalled Ubuntu 20.04.3 since I had low confidence the filesystem was not corrupt after all those deadlocks. I switched the rootfs to btrfs too but I don't expect that to impact anything as the issues we were seeing seemed below the filesystem level. I am planning on running a similar workload on it. If anything, it'll be used more.

Will report back. Hopefully this is a usable workaround.

@nickidw
Copy link

nickidw commented Dec 24, 2021

I'm running homeassistant on Pi4, after 2 SD cards destroyed in relatively short time I changed to USB SSD, and zero issues after months. If the distro used does a lot of writing (log files, db updates, etc) expect the sd cards not to last (that's been my experience).

@joetriskaide
Copy link

joetriskaide commented Apr 10, 2022

A few observations from my side to the mmc_rescan event:
I had this issue yesterday (again). What strikes me is the fact that the PI is running for days, getting into trouble and after having the trouble the first time, even a reboot leads more or less immediately into the same trouble after a few minutes.

Here is my setup:
/proc/cpuinfo:
Hardware	: BCM2835
Revision	: a020d3
Serial		: 0000000040ea362b
Model		: Raspberry Pi 3 Model B Plus Rev 1.3

I started with a traditional sd card, but recognized the speed would be to slow to run iobroker. So I decided to run the entire OS and everything else from an external USB-SSD. I created a 1:1 copy of the SD card on the USB.
I did not use the fuse in the RPi to let the Pi boot for USB, I did change the "root=" command line parameter in the boot sector on the SD card to

/boot/cmdline.txt
console=serial0,115200 console=tty1 root=/dev/sda2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait dwc_otg.lpm_enable=0 

consequently, /boot was mounted via fstab to the boot sector on the SD-Card

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/mmcblk0p1     97070   49522     47548  52% /boot
/dev/root        4159192 3317080    825728  81% /

Given this setup, there should be no or very low file activity on the SD-Card. When I initially created the USB-SDD disk setup of my iobroker Instance, I used the remaining space on the SD-Card as swap space. I assumed swap to become slow, but keeping an eye on it should be sufficient.
When I encountered the first "Workqueue: events_freezable mmc_rescan" event, I changed from SD-Card based swap device to an USB-SDD based swap device. I got rid of the /var/swap swap file at all.
I need to mention that I use LVM for my files except /boot and / file systems and my swap device is a logical volume on the USB-SDD. No PV on the SD Card at all.
What I do not understand is the fact, that with only /boot mounted and nothing else accessing the SD-Card, my PI yesterday ran into the unresponsive mode. Even watchdog seem to have stopped:

2022-04-09 17:58:34 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:58:45 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:58:52 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:59:08 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 20:27:35 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 20:27:45 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 

I just started to use watchdog in a test mode and as you can see, watchdog did not run the test script after 17:59. The entry at 20:27 was created after the reboot of the Pi. But that might (or might not) be another issue.
Another strange observation:
After running for three weeks without issues, the Pi became unresponsive and when I powered off the Pi (by pulling the plug!), it needs three or four reboot (=pulling the plug) before I could access the Pi long enough to make changes to the configuration.
Another observation, not sure it has to do with this rescan_mmc issue, is that my Pi from time to time is not reachable by Wifi, but usually comes back online after some time (tens of minutes).
Finally I have to mention that I run a 2nd Pi, same model, same setup with USB-SDD with kodi media center and I do not yet see any issues with mmc_rescan as I can see in my Pi running iobroker.
I changed system settings a proposed by samyk, keep fingers crossed and let you know, hat happens - or not.

@jdrch
Copy link

jdrch commented Apr 10, 2022

FWIW I haven't seen this problem on my 3 B+ since I updated to Raspberry Pi OS 11.x. Currently:

$ cat /etc/debian_version
11.3

I would encourage everyone to update and see what happens. And yes, I updated in place with no issues at all using this method.

@All3xJ
Copy link

All3xJ commented Apr 24, 2022

I still have this bug with 5.15.30-v8+ 64 bit Raspberry Pi 3B+.... the system freezed and resumed after 3 hours of freeze.
Here are journalctl logs:
journalctl.txt

popcornmix pushed a commit that referenced this issue Apr 25, 2022
…mpound

Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).

However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index.

This is not compatible with compound sub-pages, and can cause bad page
state issues like

  BUG: Bad page state in process swapper/0  pfn:00743
  page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
  flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
  raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: corrupted mapping in tail page
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
  Call Trace:
    dump_stack_lvl+0x74/0xa8 (unreliable)
    bad_page+0x12c/0x170
    free_tail_pages_check+0xe8/0x190
    free_pcp_prepare+0x31c/0x4e0
    free_unref_page+0x40/0x1b0
    __vunmap+0x1d8/0x420
    ...

The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.

Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
@delgurth
Copy link

delgurth commented May 8, 2022

...

hardware issue, removed comment since it's not relevant.

@jdrch
Copy link

jdrch commented May 8, 2022

FWIW I haven't seen this problem on my 3 B+ since I updated to Raspberry Pi OS 11.x. Currently:

$ cat /etc/debian_version
11.3

I would encourage everyone to update and see what happens. And yes, I updated in place with no issues at all using this method.

Welp, within the past 2 days I had my 1st recurrence of this issue.

@paralin
Copy link

paralin commented May 8, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

@delgurth
Copy link

delgurth commented May 8, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

Seems you are right. Can't even clear the partition table on the darn thing now. Will order some new cards. Thanks for the advice on what works well for you. Will also update my comment.

@All3xJ
Copy link

All3xJ commented May 24, 2022

I did change the sysctl setting as described here: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
but the freeze still happens. It happens ALWAYS in this situation:

I noticed this when I used VS Code's SSH Remote Extension which install a bunch of stuff in the remote server. This cause a HUGE use of CPU by "node" program, and then everything freezes, I couldn't even connect via SSH because the server is freezed. My remote server is a simple Raspberry Pi 3 B+, so not very powerful.

Something curious: when all this happens, I noticed in both my host machine and remote server a process "kswapd0" that consumes much CPU too, can this be related?

Also, if the scheduler sends interrupts constantly to context switch and to pass to another process, so why a certain process that consumes too much CPU can freeze the pi? Shouldn't scheduler go on with other processes equally? Why can it monopolize the CPU and freeze the pi?

@paralin
Copy link

paralin commented May 24, 2022

@All3xJ I'm regularly running entire Gentoo builds on the pi itself that completely saturate all cpu cores, and yet never run into this error after replacing the faulty microsd cards.

Try a brand new card.

@JamesH65
Copy link
Contributor

I did change the sysctl setting as described here: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ but the freeze still happens. It happens ALWAYS in this situation:

I noticed this when I used VS Code's SSH Remote Extension which install a bunch of stuff in the remote server. This cause a HUGE use of CPU by "node" program, and then everything freezes, I couldn't even connect via SSH because the server is freezed. My remote server is a simple Raspberry Pi 3 B+, so not very powerful.

Something curious: when all this happens, I noticed in both my host machine and remote server a process "kswapd0" that consumes much CPU too, can this be related?

Also, if the scheduler sends interrupts constantly to context switch and to pass to another process, so why a certain process that consumes too much CPU can freeze the pi? Shouldn't scheduler go on with other processes equally? Why can it monopolize the CPU and freeze the pi?

I believe you are running out of memory, which is why kswapd is going mad, rather than running out of CPU power.

@paralin
Copy link

paralin commented May 24, 2022

@JamesH65 is probably right, you're out of memory.

My bad, I assumed that the comment was related to this issue and not some other unrelated thing :)

@jdrch
Copy link

jdrch commented May 24, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

All storage devices fail eventually ;) But yes, I have had the best luck with SanDisk, and their warranty (both length and replacement policy) is the best in the business.

Last Friday I retired the 32 GB card I'd been using since I got the Pi in 2018 and replaced it with a 200 GB card. We'll see what happens from there.

For those wondering about the process for doing so:

  1. Use SD Card Copier (Raspberry Pi OS GUI app) to copy existing card to new card
  2. Physically swap in new card
  3. (If desired) Use GParted to:
    • resize the shared data partition to its minimum size at the end of the partition table
    • extend the root partition into the unallocated space resulting from the above

@All3xJ
Copy link

All3xJ commented May 24, 2022

I believe you are running out of memory, which is why kswapd is going mad, rather than running out of CPU power.

@JamesH65 @paralin
Is there a way to fix this? Or to workaround this?

@JamesH65
Copy link
Contributor

Use less memory? Use Pi4 with more memory? Use a USB SSD and put a swap file on it, that might be faster than swap on the SD card?

What is the Pi doing that is using all the RAM?

@All3xJ
Copy link

All3xJ commented May 25, 2022

What is the Pi doing that is using all the RAM?

I noticed this when I used VS Code's SSH Remote Extension which install and runs a bunch of stuff in the remote server. This cause a HUGE use of memory, and then everything freezes, I couldn't even connect via SSH because the server is freezed.

EDIT: I fixed by increasing the swap partition size!

@shakin89
Copy link

I would like to thanks everyone for all the comments i've found on this page.
I've an rpi4 4G memory, and from this summer i've began to have this issue too frequently.

My system is a plain debian 11, with /boot/firmware and / mounted on sdcard; /home, /var and /tmp are mounted on ssd.
I think i'll move my / to ssd to and activate the swap, and first of all i'll replace my sdcard!!! 😄

@jdrch
Copy link

jdrch commented Oct 26, 2022

If this happens your SD card is probably bad. Best replace it.
Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.
I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.
The kernel should be updated nevertheless with better error handling, in my opinion.

All storage devices fail eventually ;) But yes, I have had the best luck with SanDisk, and their warranty (both length and replacement policy) is the best in the business.

Last Friday I retired the 32 GB card I'd been using since I got the Pi in 2018 and replaced it with a 200 GB card. We'll see what happens from there.

For those wondering about the process for doing so:

1. Use SD Card Copier (Raspberry Pi OS GUI app) to copy existing card to new card

2. Physically swap in new card

3. (If desired) Use GParted to:
   
   * resize the shared data partition to its minimum size at the end of the partition table
   * extend the root partition into the unallocated space resulting from the above

FWIW replacing the microSD card has fixed the problem. I think I've experienced only 1 crash since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests