Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GT 540M (nvc1): thread lockup, high CPU utilization #5

Closed
kris7t opened this issue Jul 9, 2012 · 7 comments
Closed

GT 540M (nvc1): thread lockup, high CPU utilization #5

kris7t opened this issue Jul 9, 2012 · 7 comments

Comments

@kris7t
Copy link
Contributor

kris7t commented Jul 9, 2012

I am trying to get gdev run on my laptop with nVidia GT 540M which has Optimus features.

I had some success with porting the gdev-nouveau patch for a recent kernel (v3.5-rc5), as some older kernels seem to suffer from a regression that prevents nouveau from correctly loading the VBIOS of Optimus card.

I can get gdev load and initialize scheduling on my card.

[19399.565352] [drm] nouveau 0000:01:00.0: Detected an NVc0 generation card (0x0c1a00a1)
[19399.571400] [drm] nouveau 0000:01:00.0: Checking PRAMIN for VBIOS
[19399.581142] [drm] nouveau 0000:01:00.0: ... BIOS signature not found
[19399.581145] [drm] nouveau 0000:01:00.0: Checking PROM for VBIOS
[19399.581206] [drm] nouveau 0000:01:00.0: ... BIOS signature not found
[19399.581207] [drm] nouveau 0000:01:00.0: Checking ACPI for VBIOS
[19400.109459] [drm] nouveau 0000:01:00.0: ... appears to be valid
[19400.109463] [drm] nouveau 0000:01:00.0: Using VBIOS from ACPI
[19400.109465] [drm] nouveau 0000:01:00.0: BIT BIOS found
[19400.109467] [drm] nouveau 0000:01:00.0: Bios version 70.08.55.00
[19400.109469] [drm] nouveau 0000:01:00.0: TMDS table version 2.0
[19400.109804] [drm] nouveau 0000:01:00.0: MXM: no VBIOS data, nothing to do
[19400.109806] [drm] nouveau 0000:01:00.0: DCB version 4.0
[19400.109808] [drm] nouveau 0000:01:00.0: DCB outp 00: 02000300 00000000
[19400.109810] [drm] nouveau 0000:01:00.0: DCB conn 00: 00000000
[19400.109824] [drm] nouveau 0000:01:00.0: Adaptor not initialised, running VBIOS init tables.
[19400.109826] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 0 at offset 0xD5E7
[19400.131225] [drm] nouveau 0000:01:00.0: 0xD591: i2c wr fail: -6
[19400.171601] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 1 at offset 0xDC43
[19400.198693] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 2 at offset 0xEE49
[19400.198699] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 3 at offset 0xEE4D
[19400.198754] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 4 at offset 0xEF35
[19400.198756] [drm] nouveau 0000:01:00.0: Parsing VBIOS init table at offset 0xEF9A
[19400.221127] [TTM] Zone  kernel: Available graphics memory: 4045146 kiB
[19400.221129] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[19400.221130] [TTM] Initializing pool allocator
[19400.221133] [TTM] Initializing DMA pool allocator
[19400.221150] [drm] nouveau 0000:01:00.0: Detected 512MiB VRAM (DDR3)
[19400.225943] [drm] nouveau 0000:01:00.0: 512 MiB GART (aperture)
[19400.231122] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[19400.231123] [drm] No driver support for vblank timestamp query.
[19400.231125] [drm] nouveau 0000:01:00.0: ACPI backlight interface available, not registering our own
[19400.236646] [drm] nouveau 0000:01:00.0: 3 available performance level(s)
[19400.236649] [drm] nouveau 0000:01:00.0: 0: core 50MHz shader 101MHz memory 135MHz voltage 830mV
[19400.236651] [drm] nouveau 0000:01:00.0: 1: core 202MHz shader 405MHz memory 324MHz voltage 830mV
[19400.236653] [drm] nouveau 0000:01:00.0: 3: core 672MHz shader 1344MHz memory 900MHz voltage 980mV
[19400.236655] [drm] nouveau 0000:01:00.0: c: core 202MHz shader 405MHz memory 324MHz voltage 980mV
[19400.241476] [drm] nouveau 0000:01:00.0: MM: using COPY1 for buffer copies
[19400.318223] [drm] nouveau 0000:01:00.0: allocated 1024x768 fb: 0x120000, bo ffff88023e7e6c00
[19400.318307] fb1: nouveaufb frame buffer device
[19400.318310] [drm] Initialized nouveau 1.0.0 20120316 for 0000:01:00.0 on minor 1
[19400.794484] [drm] nouveau 0000:01:00.0: no native mode, forcing panel scaling
[19421.057854] [gdev] Loading module...
[19421.057864] [gdev] Found 1 physical device(s).
[19421.057869] [gdev] Configured 4 virtual device(s).
[19421.058054] [gdev] Gdev#0 compute scheduler running
[19421.058093] [gdev] Gdev#0 memory scheduler running
[19421.058132] [gdev] Gdev#0 compute reserve running
[19421.058172] [gdev] Gdev#0 memory reserve running
[19421.058209] [gdev] Gdev#1 compute scheduler running
[19421.058247] [gdev] Gdev#1 memory scheduler running
[19421.058286] [gdev] Gdev#1 compute reserve running
[19421.058298] [gdev] Gdev#1 memory reserve running
[19421.058321] [gdev] Gdev#2 compute scheduler running
[19421.058336] [gdev] Gdev#2 memory scheduler running
[19421.058351] [gdev] Gdev#2 compute reserve running
[19421.058367] [gdev] Gdev#2 memory reserve running
[19421.358632] [gdev] Gdev#3 compute scheduler running
[19421.358663] [gdev] Gdev#3 memory scheduler running
[19421.358695] [gdev] Gdev#3 compute reserve running
[19421.358725] [gdev] Gdev#3 memory reserve running

I added some printf statemets to the (user-mode) madd in order to trach exection but not interfere by using a debugger or similar tools.
When I try to run it, cuInit and cuDeviceGet runs successfully. Things start to get interesting with cuCtxCreate, the first call that actually "does something" with the GPU. Although it returns successfully, the following dmesg output is produced:

*** [19435.092642] [gdev] Created context object on gdev0
[19435.092727] [drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0000000000 [PT_NOT_PRESENT] from PGRAPH/CTXCTL on channel 0x00004de000
[19435.092743] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000
*** [19435.092964] [gdev] Created DMA object on gdev0
*** [19435.092977] [gdev] Created scheduling entity on gdev0
[19435.092980] [gdev] Opened gdev0

Lines marked by *** were added by me, and complement the messages that would be printed in case of error. To sum up, the card gets into me sounknown state occurs, but some further memory allocations succeed and the CUDA context can be created.

The test program continues from this point. However, cuModuleLoad never returns and the CPU core on which madd runs gets stuck with 100% utilization. I cannot kill the test application with a SIGTERM, nor a SIGKILL, so I think it could be safely concluded that the kernel is stuck waiting for some spinlock.

Because what gdev_raw_ctx_new does is basically some buffer object allocation with gdev_drv_bo_alloc, I attempted fiddling with that method a bit. Setting an alignment value (0x1000) does not help. I haven't much knowledge about the inner workings of nouveau, nor implementation details of nVidia card, and the gdev_drv_bo_alloc -- comparing it with other code in v3.5-rc5's nouveau -- looked otherwise idiomatic to me, I cannot thing about anything else that could be done to fix this problem.

Do you have any idea how could I get gdev work on my machine? The other card I have access to is GTX 560 Ti (nvc3), which is even newer and I imagine has even... lighter support in nouveau. Although of course I will try gdev on that machine too: the lack of Optimus might help a bit. But currently I only have remote access to that machine and random kernel panics could annoy the system administrators deeply.

I also tried the pscnv driver, but had even less success: it loads an invalid VBIOS from PCIROM, thus a default: case with BUG(1) gets hit when pscnv tries to set the clock frequency for the card. (I wrote a patch for Bumblebee, the script that makes use of Optimus under Linux, to load pscnv. I will, for obvious reasons, only submit that patch to the Bumblebee git repo when I can actually load pscnv without a kernel panic.)

@shinpei0208
Copy link
Owner

[19435.092727] [drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0000000000 [PT_NOT_PRESENT] from PGRAPH/CTXCTL on channel 0x00004de000
[19435.092743] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000

This is bad. I suspect that Gdev accesses Nouveau in a wrong way. It's certainly fixable, but I'm wondering how to fix it. I don't have 540M at my hand. However I have GTX 560 Ti with me. I can try to install your v3.5-rc5. Is it possible for you to put your kernel somewhere online I can access?

@kris7t
Copy link
Contributor Author

kris7t commented Jul 10, 2012

The patch is on my Github for of gdev, in the new-nouveau-patch, for which I submitted a pull request that I closed immediately after I faced this bug.

The patch is identical to the one in your repository, except that nouveau_bo_new got a new parameter that should be set to NULL if the buffer object is not intended to be shared between multiple drivers (for example, nouveau and intel for Optimus OpenGL rendering support); and the primary indexes now seem to start from 1, not 0 so 1 needs to be subtracted from them before they can be used as an array index.

Anyways, I compiled the kernel with that patch on the machine with GTX 560 Ti, and the following appeared in dmesg when I tried to start the X server:

[  153.825331] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  163.949164] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[  165.946184] [drm] nouveau 0000:01:00.0: 0x2634 != chid: 0x00100001
[  165.946287] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  165.946288] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000
[  167.941972] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  168.232173] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  170.227812] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  179.482762] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[  181.479832] [drm] nouveau 0000:01:00.0: 0x2634 != chid: 0x00100001
[  181.479936] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  183.475619] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  183.768107] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  185.763748] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  194.993081] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[  196.990235] [drm] nouveau 0000:01:00.0: 0x2634 != chid: 0x00100001
[  196.990381] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  198.986023] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  199.276366] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  201.272005] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
[  210.506718] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[  212.503890] [drm] nouveau 0000:01:00.0: 0x2634 != chid: 0x00100001
[  212.504036] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x00000100
[  214.499678] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed

[  403.803142] [drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0008010000 [INVALID_STORAGE_TYPE] from PFIFO/PFIFO on channel 0x00000cb000
[  403.803147] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000

The X server was completely unable to start with xf86-video-nouveau, even without experimental DRI support installed.
I was also unable to proceed and load the gdev kernel module, because the console framebuffer was also corrupted.

It seems some change in upstream nouveau causes the gdev patch to misbehave. Looking at the git log of the nouveau drivers in the kernel tree, there are several recent commits concerning PFIFO behaviour, the biggest of which is c420b2dc8dc3cdd507214f4df5c5f96f08812cbe.

I find it a bit strange that the unknown PFIFO status is triggered even without loading the gdev module (on my laptop, it only occurs after context creation).
The only non-trivial part of the gdev nouveau patch that may get executed in this situation is the added if statement in nvc0_graph_isr, however, I do not understand what is happening in that function except that some magic numbers get read and written from/to the card...

@shinpei0208
Copy link
Owner

Hi Kris,

Thanks for the patch. I take a look. Regarding the error you encountered above, I would appreciate if you could post on #nouveau at IRC. Do you have an IRC account? Almost all Nouveau matters are discussed there.
See also http://nouveau.freedesktop.org/wiki/ about Nouveau. If you are not familiar with IRC, please let me know.

Shinpei

@mharsch
Copy link
Collaborator

mharsch commented Jul 22, 2012

I've been able to reproduce the issue initially described here with my GTX 550 Ti card. Working backwards, I've determined that the breakage occurred between v3.3 and v3.4 (v3.3 works, v3.4 behaves as described above).

@shinpei0208
Copy link
Owner

Actually, I'm quite surprised that v3.3 even works, as the latest version that I confirmed working with Gdev was v3.0. v3.4 must have introduced some features and hence they changed the functionality of Nouveau more or less. I'll look into the change history and update Gdev to work with it. Thanks!

Shinpei

@shinpei0208
Copy link
Owner

Fixed.

@yuhc
Copy link

yuhc commented Feb 13, 2017

Could you post the commit which fixed this issue? I met the same problem when running an old version of gdev (605e69e70ce7b4c505be91696612e98649ec383f). For some reason, I can't move to a latest version.

More specifically, gmemcpy_to_device() hangs in cuda/driver_api/modules.c . gmemcpy_to_device() calls gdev_poll() to wait for the resource, but fence_read() never returns a correct seq number.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants