New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmc0 Timeout waiting for hardware interrupt #2392

Closed
stschake opened this Issue Feb 21, 2018 · 16 comments

Comments

Projects
None yet
6 participants
@stschake
Contributor

stschake commented Feb 21, 2018

Hey,

we've been seeing the above error randomly occur with Raspberry Pi 3s. SD card manufacturer or new/old doesn't seem to make any difference. I've attached the full message log for the error.

Is this a known issue that can be mitigated somehow? We can't use the alternate SD card controller since we need the WLAN connectivity.

mmc_timeout.txt

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 21, 2018

This is not a common failure - after a difficult development and some initial glitches the sdhost driver has been stable for at least a year now. There could well be bugs, but if so they must have been hiding well.

The log shows the last transfer (prior to this the interface was idle, periodically checking that the card is still present):

[13587.156458] [e97700ea] PRD< c2231240 0
[13587.162954] [e97700eb] PRD1 dea95b10 0
[13587.169461] [e97700f3] PRD2 a 0
[13587.175348] [e9770104] PRD3 cab66184 0
[13587.181863] [e9770105] PDM> c2231240 0
[13587.188327] [e9770106] REQ< c2231168 10801
[13587.195172] [e9770107] CMD< 17 50
[13587.201200] [e977010b] FCM< c2231168 c22311a4
[13587.208322] [e977010d] RSP  900 0
[13587.214378] [e977010e] CMD< 12 19c380
[13587.220771] [e9770110] CMDD 50 200
[13587.226933] [e9770110] SDMA c2231240 dea95b10
[13587.234098] [e9770112] FCM< c2231168 c22311d8
[13587.241258] [e9770113] RSP  900 0
[13587.247337] [e9770115] FCM> c2231168 0
[13587.253826] [e9770116] FCM> c2231168 0
[13587.260321] [e9770116] CMD  12 200
[13587.266457] [e9770116] REQ> c2231168 0
[13587.272942] [ea16f1e2] TIM< 0 0

It's a CMD23 (0x17 - Set block count) indicating an 80 sector transfer, followed by a CMD18 (0x12 - Read multiple blocks) starting at sector 0x19c380 (about 845MB from the start), preceded by the necessary DMA setup, but for some reason the DMA never completes so the transfer times out.

How are you powering the Pi3? Is the power supply capable of delivering 2.5A? Do you ever see the yellow lightning signal appear on the display? What does running vcgencmd get_throttled report during typical activity?

@stschake

This comment has been minimized.

Contributor

stschake commented Feb 22, 2018

We don't have the warning signs enabled since we use the full KMS. I have just tested with high CPU load for a good time and there was no undervoltage event observed. The power supplies are rated for 2.5A.

I'll see if I can get some more traces, unfortunately there sometimes seem to be IO related lockups that don't even trigger the timeout.

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 22, 2018

You say you've tried multiple Pi3s and multiple SD cards, so there must be something different about your environment otherwise lots of people would be reporting similar problems. Can you post your config.txt and cmdline.txt, and tell me a bit about the software you are running?

@stschake

This comment has been minimized.

Contributor

stschake commented Feb 22, 2018

I did find one recent report of similar issues: https://patchwork.kernel.org/patch/10219145/
From what I can tell, the upstream sdhost driver mentioned there is identical to the downstream one, minus the extensive logging/tracing.

I've attached config/cmdline. The software is indeed a bit more special. While the kernel is rpi-4.9.y with a few choice cherry-picks from AOSP kernel common, in userland we run Android on a mesa3d v18 graphics stack.

cmdline.txt
config.txt

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 22, 2018

That all looks reasonable, as do the two patches you referenced. I'll do something similar in the downstream version of the driver (with a hat-tip to the original poster).

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 22, 2018

I've got a patch which seems to work, although I've only been able to test it by faking a timeout. If I upload the patch somewhere will you be able to try it and test it?

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 22, 2018

The patch can be downloaded here.

@stschake

This comment has been minimized.

Contributor

stschake commented Feb 23, 2018

Thanks, I'll test that. It is unfortunately difficult to reproduce reliably.

@pelwell

This comment has been minimized.

Contributor

pelwell commented Feb 23, 2018

Any evidence of a single timeout from which it recovers would be great, but since the timeouts are basically fatal at the moment the required confidence level for merging is pretty low.

@oniongarlic

This comment has been minimized.

Contributor

oniongarlic commented Feb 26, 2018

Just a note, I've started to experience this with a SD card (Samsung 16 EVO, exact same model as mentioned in the patchworks thread linked above) that was previously running wheezy on a Pi1B without any issues and re-written with latest Raspbian. Tested on both Pi2 B and an original Pi1, both fail. A pretty reliable way to reproduce is running badblocks over the card.

@lategoodbye

This comment has been minimized.

Contributor

lategoodbye commented Mar 1, 2018

@stschake:

Are you able to reproduce this issue with the bcm2835.c instead of bcm2835-sdhost.c (they are more different than you think)?

@oniongarlic

Are you able to reproduce this issue with a different SD card than Samsung 16 EVO?
Which parameters did you use for badblocks to reproduce the issue?

@JamesH65

This comment has been minimized.

Contributor

JamesH65 commented Apr 23, 2018

@stschake Did @pelwell patch help in any way or are you still seeing the issue?

@oniongarlic Any further comment?

@stschake

This comment has been minimized.

Contributor

stschake commented Apr 23, 2018

Sorry, I haven't seen the issue since - but since the patch would only trigger on an actual timeout, I can't say either way.

@nullr0ute

This comment has been minimized.

Contributor

nullr0ute commented Apr 23, 2018

I've seen some issues like this on Fedora, I had one specific RPi3 that would see similar style issues and went through a number of good quality (Sandisk Ultra and Samsung EVO) in the process but seems to have settled with 4.15+ (although maybe it's wishful thinking there). I wonder if the move of the MMC stack to MQ and some of the improvements there have helped this problem or maybe masked it in different ways?

@JamesH65

This comment has been minimized.

Contributor

JamesH65 commented Jun 27, 2018

@pelwell Thoughts on closing this?

@pelwell

This comment has been minimized.

Contributor

pelwell commented Jun 27, 2018

OK with me. Closed issues are still searchable and can be re-opened as needed, but this been quiet for long enough.

@pelwell pelwell closed this Jun 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment