New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpi-5.4.y: Fixes for V3D DRM timeout issues #3816
base: rpi-5.4.y
Are you sure you want to change the base?
Conversation
Two of these patches appear to be from upstream Please add the line @naushir Could you ping Igalia over the two v3d patches please? They really ought to be upstreamed too if correct. |
We were dealing with the piglit timeouts errors in the kernel since we upgraded to 5.4,y and reached to a similar fix to I was about to create a pull request to cherry-pick torvalds/linux@135517d and I've just found this recent Issue. @6by9 We will have a look to the patches, and check if they fix some of the errors we are still seeing when running the full-testsuites that didn't happen on 4.19.y @Terminus-IMRC do you have a particular test to reproduce the behaviour you describe at 2) |
69af0ae
to
f04b87f
Compare
Thank you for the responses! @6by9 I see. We've done editing the commit messages. @txenoo You may use our 32-bit build linux-image-5.4.58-v7l-idein-109 for ease (you will need The current sgemm.py runs in about 0.55 seconds. Tweaking |
@Terminus-IMRC, I've been testing your branch, and the CSD timeouts are notified and treated properly, piglit run exposes one of that situations, so I'm going to check it. But test run time execution continues to be higher that with 4.19.y, I've been checking the difference in the tests executions time and test from piglit under the group spec/arb_depth_buffer_float/depthstencil-render-miplevels are terrible slow. I think that the cause of this is related to MMU errors, that are hidden in the log of rpi-5.4.y kernel because of f23e93b ""drm/v3d: Suppress all but the first MMU error". I've checked and there is a reported issue #3574 "RPi4 2Gb MMU crash V3d " related to this in 5.4 |
@txenoo Thank you for testing! I've run the tests ( |
@Terminus-IMRC I've confirmed that theese slow tests are HW dependent, I've was launching the runs in a RPI4-8Gb and the MMU errors are hundreds. But in a RPI4-4Gb, my numbers are similar to yours with fewer MMU errors. And time execution is similar to 4.19.y with 20 minuties with my default test selection. But the RPI4-8Gb times are 1h30m with ~200 tests just returning timeout (90s). All timeout tests are also reporting MMU errors that are not reported in the 4Gb RPI4. |
@Terminus-IMRC . I couldn't find why torvalds/linux@135517d is required? I think that torvalds/linux@d7c5782 is just a fix-up of the regression introduced at 167bf96014a09. I think that "
I've seen that at dri-devel mailing list etnaviv has recently send their fix for this issue. So it would probably make sense to check their solution and submit a fix upstream too for v3d too. |
I was checking this issue and finally it was an issue with the deployment. The PRI4-8Gb I was using remotely had installed the xscreensaver that was activated during test runs. And some of the executed screensavers are generating extra MMU errors and that made the tests to timeout. So after a reinstall of RaspbianOS in one RPI4-8Gb and just run piglit times were in the range of 22 minutes. In any case we need to check what are this screensavers doing but that's a different Issue. Mystery solved. Sorry for the noise with my time executions. |
f04b87f
to
7e190e9
Compare
@txenoo Thank you for the reports and for letting me know the Etnaviv patch! Sorry, I was confusing about torvalds/linux@135517d -- I thought it was needed for torvalds/linux@d7c5782, but it turned out that the commit is needed to prevent a kernel NULL pointer dereference when many CSD timeouts occur behind a Vulkan program:
However, this issue is not related to this PR, so now I removed the relevant commits from the branch. |
Comment here when it gets merged upstream. |
The patchset waiting for comments is here: https://lists.freedesktop.org/archives/dri-devel/2020-September/278609.html . |
So, no activity since Sept as far as I can see? Is there something that is blocking this patchset from being merged? |
No, with a reason I don't know... A few weeks later from the patchset I sent a reminder to @anholt, the sole maintainer of the driver, but currently, I've received no responses. Do you see something wrong in my patchset? |
I don't know that section of the kernel... so can't comment at the moment, but was wanting to try out qmkl6 on the 64bit Raspbian kernel... which is 5.4.79* based |
The previous code misses a check for the timeout error set by drm_sched_resubmit_jobs(), which results in an infinite GPU reset loop if once a timeout occurs: [ 178.799106] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang. [ 178.807836] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000 [ 179.839132] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang. [ 179.847865] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000 [ 180.879146] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang. [ 180.887925] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000 [ 181.919188] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang. [ 181.928002] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000 ... This commit adds the check for timeout as in v3d_{bin,render}_job_run(): [ 66.408962] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang. [ 66.417734] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000 [ 66.428296] [drm] Skipping CSD job resubmission due to previous error (-125) , where -125 is -ECANCELED, though users currently have no way other than inspecting the dmesg to check if the timeout has occurred. Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
The default timeout is 500 ms which is too short for some workloads including Piglit. Adding this parameter will help users to run heavier tasks. Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
7e190e9
to
fb3a322
Compare
Thank you for the interest! If you trust us, then you can use our build for aarch64. You can install this by running Otherwise, you can build the Idein/linux:rpi-5.4.y-v3d-timeout tree by yourself. Adding |
The patches are blocked waiting for Reviewed-by responses. Eric has moved on to pastures new, although is still listed as the supporter for V3D. Kernel patch reviewing generally requires some knowledge of the system, and certainly in the 3D land that's not me. |
Hi there, the issue that this commit is attempting to fix is hurting my project. I would love to see it merged and fixed. May I enquire kindly when it will be released? |
I've been knee-deep in |
@wimrijnders I'm happy if you reply to my patch mail with Tested-by or Reviewed-by tag with a description of your tests (see https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes ). I don't know who is responsible for the employment of the maintainer. |
@wimrijnders Anyone can send a Tested-by. Looking at "drm/v3d: Add job timeout module param", whilst module parameters aren't nice, the patch is fine.
would presumably get checkpatch complaining about the indentation not matching the opening parenthesis. The other patch does require some knowledge of v3d. If you feel you have sufficient knowledge then feel free to send a Reviewed-by. Terminus-IMRC has done a good job of describing the issue and solution in the commit text, and it does appear to do what it says. |
Thank you for the review! I'll try to move the timeout module parameter to debugfs. |
Don't bother - debugfs is overkill for this. I'm happy with it being a module parameter, although perhaps the default just needs to be increased. |
@Terminus-IMRC Now's there's a handle which I've encountered often in my search for information on Since my project depends on this bug fixed, I'll put the time in it for a proper test. I'll read up and then see what I can do. |
OK, I said that I will look into confirming the fix and am gearing up to it. This is the approach that looks most logical to me:
Is this the proper and accepted way of doing this? This looks like it will take a long time. But, whatever it takes. |
@Terminus-IMRC I've devoted some time to review the two patches currently included in this PR. I have still pending to review |
@wimrijnders Yes, please go ahead. @txenoo Thank you for reviewing! Please feel free to point out my mistakes, because I have not so much experience in making a patch. |
Is there a plan for this to be merged? |
The current V3D DRM driver in
rpi-5.4.y
kernel has two issues:NULL
fromv3d_cache_clean_job_run()
is mis-treated as an error by the DRM scheduler, which results in the following warning:It takes some time to print the warning and thus the execution time of CL/CSD programs will get longer.
This warning can be ceased with 3c37926, and this commit requires 93db7d6.
However, 93db7d6 is known to break the behavior where
v3d_{cl,csd}_job_timedout()
expect the timer to be rearmed.Since workarounds for this issue are not available yet, we changed the codes not to expect the timer rearming in b465bea.
The non-rearming timer issue should be fixed in the DRM scheduler itself (as said by Daniel Vetter in the thread) because the Etnaviv DRM driver also expects the behavior.
Therefore, we think these commits are appropriate only for the rpi tree.
v3d_csd_job_run()
ignores timeout error flag which results in an infinite GPU reset loop if once a timeout occurs:This issue is fixed with 410a046 by correctly referring to the timed-out flag.
We also added a function to set the timeout milliseconds through kernel's cmdline in 69af0ae.
Because these patches depend on the timeout behavior described in 1, and
bcm2711_defconfig
is not available in drm-next for now, we choose not to post these patches to dri-devel yet.We tested these patches with Piglit and SaschaWillems/Vulkan with Igalia's Vulkan driver, and our CSD programs in Idein/py-videocore6.
The programs run peacefully with each other even if a timeout occurs.
Note that the first two drm/sched commits are from linux-5.6.y tree, so you may cherry-pick the other three commits by us to rpi-5.8.y if we are moving to there.
@anholt Is there any reason that the timeout flag is referred to in
v3d_{bin,render}_job_run()
but is ignored inv3d_csd_job_run()
?