Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EGL Context creations hangs after VideoCore crash #254

Open
DjPale opened this issue Sep 3, 2015 · 20 comments
Open

EGL Context creations hangs after VideoCore crash #254

DjPale opened this issue Sep 3, 2015 · 20 comments

Comments

@DjPale
Copy link

@DjPale DjPale commented Sep 3, 2015

I have seen similar issues like this - but they're quite old and maybe not related to the same bug. I've compiled SDL2 according to https://solarianprogrammer.com/2015/01/22/raspberry-pi-raspbian-getting-started-sdl-2/.

At some point the display freezes up and have to be released by SIGKILL signal to the app.
After this- all applications which tries to create an EGL context won't start at all. The only way to recover that I have found is to reboot the system.

This has happened under both an RPi2 with newest Raspbian distro Linux rpi2 4.1.6-v7+ #810 SMP PREEMPT Tue Aug 18 15:32:12 BST 2015 armv7l GNU/Linux and also version 1 B with the RetroPie distro Linux retropie 3.18.11+ #781 PREEMPT Tue Apr 21 18:02:18 BST 2015 armv6l GNU/Linux.

I tested a very simple program found here: http://pastebin.com/Vnje5sEe which is using the PI GL API directly (not SDL2), and from what I can see the function call eglCreateContext never returns.

I do not have any exact steps to re-create this error yet - but the fact that some calls never return should nevertheless never happen in my opinion.

@popcornmix
Copy link
Contributor

@popcornmix popcornmix commented Sep 4, 2015

The problem isn't that eglCreateContext doesn't return - it sounds like the gpu has crashed.
I suspect that video playback (e.g. hello_video) and quite possibly vcgencmd will also be failing at this point.

It might be worth setting start_debug=1 in config.txt and after the crash running:

sudo vcdbg log msg
sudo vcdbg log assert
sudo vcdbg malloc
sudo vcdbg reloc

Ideally run vcgencmd cache_flush before the malloc/reloc commands, although that command may fail depending on how crashed the gpu is.

Really you need to provide a test app that I can run that provokes the gpu crash. That way I can get the gpu debugger connected and see what the problem is.

Just stating the obvious, but if you are having any stability issues, then disable overclocking before running any tests.

@bluefishisme
Copy link

@bluefishisme bluefishisme commented Oct 14, 2015

I have seen similar issues like this , after the program is freeze , and I kill it , it can not run it again , only reboot can solve this problem. I have little program with source can repeat this problem.
I post it in this link
https://www.raspberrypi.org/forums/viewtopic.php?f=67&t=121267

But no one have any comment.

@ykram
Copy link

@ykram ykram commented May 23, 2016

@bluefishisme did you ever figure out how to fix your issue? The reason I ask is because I'm experiencing the exact same symptoms you are in that even vgencmd is freezing after openvg calls occur:

ioctl(3, 0xc01cc402

Appears to hang there and all subsequent openvg calls fail.

Also,
$ sudo vcdbg log msg
shows:
412414.170: vcos_abort: Halting

@popcornmix
Copy link
Contributor

@popcornmix popcornmix commented May 23, 2016

@ykram Do you have an application I can run on raspbian that provokes the vcos_abort?
I could at least then determine the backtrace that resulted in that.

@ykram
Copy link

@ykram ykram commented May 23, 2016

I can upload the source being used that triggers the issue although it
depends on the OpenVG wrapper (ajstarks/openvg repo) and also requires
input as it uses IPC to dictate how things get drawn but I can provide a
dummy app that can send data so it'd work. What's the best way to get those
to you?

On Mon, May 23, 2016 at 9:20 AM, popcornmix notifications@github.com
wrote:

@ykram https://github.com/ykram Do you have an application I can run on
raspbian that provokes the vcos_abort?
I could at least then determine the backtrace that resulted in that.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#254 (comment)

@popcornmix
Copy link
Contributor

@popcornmix popcornmix commented May 23, 2016

Just zip/tar up the files I need to run and give me a link (e.g. to dropbox/google drive).
I don't need the source just something that when run provokes a vcos_abort.

@ykram
Copy link

@ykram ykram commented May 23, 2016

I'll try to get this archived and sent to you today. I have to recompile
some network specific parts to make it so that you'll be able to reproduce
sending/receiving data that the OpenVG calls depend on.

On Mon, May 23, 2016 at 9:44 AM, popcornmix notifications@github.com
wrote:

Just zip/tar up the files I need to run and give me a link (e.g. to
dropbox/google drive).
I don't need the source just something that when run provokes a vcos_abort.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#254 (comment)

@ykram
Copy link

@ykram ykram commented May 24, 2016

So I've been trying to reproduce this using an application that reads
network data and replays it back to the server so that the OpenVG client
can read the data and interpret it + display it but I can't get it to crash
this way. If I use the application as intended however then it'll crash
randomly (vcos_abort()). Is there anyway I can generate a
stacktrace/coredump and get you those files to debug?

On Mon, May 23, 2016 at 12:16 PM, Mark M mark@noffle.net wrote:

I'll try to get this archived and sent to you today. I have to recompile
some network specific parts to make it so that you'll be able to reproduce
sending/receiving data that the OpenVG calls depend on.

On Mon, May 23, 2016 at 9:44 AM, popcornmix notifications@github.com
wrote:

Just zip/tar up the files I need to run and give me a link (e.g. to
dropbox/google drive).
I don't need the source just something that when run provokes a
vcos_abort.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#254 (comment)

@popcornmix
Copy link
Contributor

@popcornmix popcornmix commented May 25, 2016

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't help.
It's not possible to capture a gpu stacktrace/coredump.

@ykram
Copy link

@ykram ykram commented May 25, 2016

Ah, bummer. I'll work on creating a POC that reproduces the issue and will
reply back here as soon as I get something created that I can use to
reliably reproduce the bug.

On Wed, May 25, 2016 at 9:56 AM, popcornmix notifications@github.com
wrote:

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't
help.
It's not possible to capture a gpu stacktrace/coredump.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#254 (comment)

@ykram
Copy link

@ykram ykram commented Jun 3, 2016

I'm still trying to get a way to reproduce this reliably but in the mean
time I did find where the loop/wait seems to occur, if this is helpful:
#0 0x76d8ba40 in do_futex_wait (isem=isem@entry=0x76c29a40 <khrn_queue+76>)
at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:48
#1 0x76d8baf4 in __new_sem_wait (sem=0x76c29a40 <khrn_queue+76>)
at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:69
#2 0x76b51aa4 in vchiu_queue_pop () from /opt/vc/lib/libvchiq_arm.so
#3 0x76c02be8 in rpc_recv () from /opt/vc/lib/libEGL.so
#4 0x76c132dc in vguLine () from /opt/vc/lib/libEGL.so
#5 0x76da9920 in Line () from /usr/lib/libshapes.so
#6 0x43b66666 in ?? ()

As I said, still working on getting something that you can run that'll
reproduce this for you.

On Wed, May 25, 2016 at 11:40 AM, Mark M mark@noffle.net wrote:

Ah, bummer. I'll work on creating a POC that reproduces the issue and will
reply back here as soon as I get something created that I can use to
reliably reproduce the bug.

On Wed, May 25, 2016 at 9:56 AM, popcornmix notifications@github.com
wrote:

It's the gpu that is calling vcos_abort, so arm stacktrace/coredump won't
help.
It's not possible to capture a gpu stacktrace/coredump.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#254 (comment)

@Ruffio
Copy link

@Ruffio Ruffio commented Sep 4, 2016

@ykram any progress on the POC?

@Ruffio
Copy link

@Ruffio Ruffio commented Dec 30, 2016

@ykram any progress on the POC? (This is second ping...)

@julianscheel
Copy link
Contributor

@julianscheel julianscheel commented Jan 20, 2017

We saw a very similar (possibly the same) issue in our firmware. We could reproduce it using hello_triangle.bin. Start it, then call tvservice -p and restart hello_triangle.bin. After a few cycles hello_triangle.bin would not start again and the stack shows that it is hanging in eglContextCreate.

After a lot of digging around we realized that we had enable_hdmi_status=1 set in config.txt. After removing that the issue did not appear again. Do you possibly have that option set as well @DjPale?

@popcornmix Any thoughts about this?

@popcornmix
Copy link
Contributor

@popcornmix popcornmix commented Jan 20, 2017

@julianscheel I've just tried:

while : ; do (./hello_triangle.bin &); sleep 2; tvservice -p; sleep 2; killall hello_triangle.bin; done

with and without enable_hdmi_status=1 and it seems to running okay. Is that what you meant?

@dennis-hamester
Copy link

@dennis-hamester dennis-hamester commented Jan 23, 2017

@popcornmix: Can you try again with this script?

#!/bin/sh
while : ; do
        tvservice -p
        ./hello_triangle.bin &
        PID=$!
        tvservice -p
        sleep 5
        kill $PID

        ./hello_triangle.bin &
        PID=$!
        sleep 5
        kill $PID
done

Starting tvservice immediately before hello_triangle seems to be necessary. With this script, I can reliably trigger the bug in a fully updated raspbian and with enable_hdmi_status=1. It usually takes about 10 iterations of the loop to actually happen.

The second invocation of hello_triangle exists just so that it is easier to check whether or not the bug triggered.

edit: It can also take many more iterations than just 10, but so far, the bug always triggers here eventually.

@camthesaxman
Copy link

@camthesaxman camthesaxman commented Jun 11, 2018

Any status on resolving this bug? I'm currently being affected by it, even in 2018 with Raspbian Stretch.

@JamesH65
Copy link
Collaborator

@JamesH65 JamesH65 commented Jun 12, 2018

I doubt anyone is looking at it, unfortunately it's very low priority, and we have oodles of higher priority stuff to fix/develop.

@6by9
Copy link
Contributor

@6by9 6by9 commented Jun 12, 2018

@camthesaxman If you have a simple test case you can share that triggers the lockup, then we can investigate the issue.

@JamesH65
Copy link
Collaborator

@JamesH65 JamesH65 commented Jan 8, 2019

This issue will be closed within 30 days unless further interactions are posted. If you wish this issue to remain open, please add a comment. A closed issue may be reopened if requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.