Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

User-level madd test program hungs after several repeated executions #8

Closed
wkbjerry opened this Issue · 10 comments

3 participants

@wkbjerry
Collaborator

My machine has GTX480 GPU (1536MB), and linux 3.3.0 installed. Both gdev/mod/nouveau and gdev/mod/pscnv compile successfully. Each of them loads well too.

I tried the user-level test program madd by repeated executing it several times, and it executes successfully initially. Then when I execute it for around the 20+th time, the program hungs. Here's dmesg output when pscnv was used:

[drm] Initialized drm 1.1.0 20060810
[drm] pscnv 0000:01:00.0: Detected an NVc0 generation card (0x0c0000a3)
[drm] pscnv 0000:01:00.0: Initializing card...
[drm] pscnv 0000:01:00.0: Attempting to load BIOS image from PRAMIN
[drm] pscnv 0000:01:00.0: ... appears to be valid
[drm] pscnv 0000:01:00.0: BIT BIOS found
[drm] pscnv 0000:01:00.0: Bios version 70.00.35.00
[drm] pscnv 0000:01:00.0: TMDS table version 2.0
[drm] pscnv 0000:01:00.0: Found Display Configuration Block version 4.0
[drm] pscnv 0000:01:00.0: Raw DCB entry 0: 02000300 00000000
[drm] pscnv 0000:01:00.0: Raw DCB entry 1: 01000302 00020030
[drm] pscnv 0000:01:00.0: Raw DCB entry 2: 04011380 00000000
[drm] pscnv 0000:01:00.0: Raw DCB entry 3: 08011382 00020030
[drm] pscnv 0000:01:00.0: Raw DCB entry 4: 02022362 00020010
[drm] pscnv 0000:01:00.0: DCB connector table: VHER 0x40 5 16 4
[drm] pscnv 0000:01:00.0: 0: 0x00001030: type 0x30 idx 0 tag 0x07
[drm] pscnv 0000:01:00.0: 1: 0x00010130: type 0x30 idx 1 tag 0x51
[drm] pscnv 0000:01:00.0: 2: 0x00002261: type 0x61 idx 2 tag 0x08
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table 0 at offset 0x6ABA
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table 1 at offset 0x7151
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table 2 at offset 0x83C0
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table 3 at offset 0x83CA
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table 4 at offset 0x85D3
[drm] pscnv 0000:01:00.0: Parsing VBIOS init table at offset 0x8638
[drm] pscnv 0000:01:00.0: 0x8638: Condition still not met after 20ms, skipping following opcodes
[drm] pscnv 0000:01:00.0: VRAM: size 0x60000000, 6 controllers
adt7475 1-002e: ADT7473 device, revision 1
[drm] pscnv 0000:01:00.0: Detected monitoring device: adt7473
[drm] pscnv 0000:01:00.0: 4 available performance level(s)
[drm] pscnv 0000:01:00.0: 0: core 50MHz shader 101MHz memory 135MHz voltage 962mV
[drm] pscnv 0000:01:00.0: 1: core 405MHz shader 810MHz memory 324MHz voltage 962mV
[drm] pscnv 0000:01:00.0: 2: core 405MHz shader 810MHz memory 1848MHz voltage 987mV-1125mV
[drm] pscnv 0000:01:00.0: 3: core 700MHz shader 1401MHz memory 1848MHz voltage 987mV-1125mV
[drm] pscnv 0000:01:00.0: c: core 50MHz shader 101MHz memory 135MHz
[drm] pscnv 0000:01:00.0: VM: Allocating vspace -3
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel -3
[drm] pscnv 0000:01:00.0: VM: Allocating vspace -1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel -1
[drm] pscnv 0000:01:00.0: PGRAPH: Initializing...
[drm] pscnv 0000:01:00.0: PCOPY0: Initializing...
[drm] pscnv 0000:01:00.0: PCOPY1: Initializing...
[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] No driver support for vblank timestamp query.
[drm] pscnv 0000:01:00.0: allocated 1600x900 fb: 0x200000 0x80000, bo ffff8802246072c0
fbcon: nouveaufb (fb0) is primary device
Console: switching to colour frame buffer device 200x56
fb0: nouveaufb frame buffer device
drm: registered panic notifier
[drm] pscnv 0000:01:00.0: Card initialized.
[drm] Initialized pscnv 0.0.0 20120123 for 0000:01:00.0 on minor 0
[gdev] Loading module...
[gdev] Found 1 physical device(s).
[gdev] Configured 4 virtual device(s).
[gdev] Gdev#0 memory scheduler running
[gdev] Gdev#0 compute reserve running
[gdev] Gdev#0 memory reserve running
[gdev] Gdev#1 compute scheduler running
[gdev] Gdev#1 memory scheduler running
[gdev] Gdev#1 compute reserve running
[gdev] Gdev#1 memory reserve running
[gdev] Gdev#2 compute scheduler running
[gdev] Gdev#2 memory scheduler running
[gdev] Gdev#2 compute reserve running
[gdev] Gdev#2 memory reserve running
[gdev] Gdev#3 compute scheduler running
[gdev] Gdev#3 memory scheduler running
[gdev] Gdev#3 compute reserve running
[gdev] Gdev#3 memory reserve running
[gdev] Gdev#0 compute scheduler running
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[drm] pscnv 0000:01:00.0: PGRAPH: generating default grctx
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: CHAN: Freeing channel 1
[drm] pscnv 0000:01:00.0: VM: Freeing vspace 1
[gdev] Closed gdev0
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: CHAN: Freeing channel 1
[drm] pscnv 0000:01:00.0: VM: Freeing vspace 1
[gdev] Closed gdev0
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: CHAN: Freeing channel 1
[drm] pscnv 0000:01:00.0: VM: Freeing vspace 1
[gdev] Closed gdev0
...... ...... ......
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: CHAN: Freeing channel 1
[drm] pscnv 0000:01:00.0: VM: Freeing vspace 1
[gdev] Closed gdev0
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: CHAN: Freeing channel 1
[drm] pscnv 0000:01:00.0: VM: Freeing vspace 1
[gdev] Closed gdev0
[drm] pscnv 0000:01:00.0: VM: Allocating vspace 1
[drm] pscnv 0000:01:00.0: CHAN: Allocating channel 1
[gdev] Opened gdev0
[drm] pscnv 0000:01:00.0: PGRAPH PAGE FAULT at 0x0000001000 (r, PTE not present)

The screen froze when nouveau was used, and I was not able to get the dmesg output. I also tried the nouveau patch with the nouveau source code bundled with linux 3.3.0 kernel. The same problem happened.

I remember I also met similar problem when I used GTX580, linux 3.3.8, and nouveau earlier this week. In that case, madd hang when it executed for the 20th time, but the screen did not freeze. I was able to switch to another console and get the dmesg output as follows:

[drm] Initialized drm 1.1.0 20060810
VGA switcheroo: detected Optimus DSM method \ handle
[drm] nouveau 0000:01:00.0: Detected an NVc0 generation card (0x0c8000a1)
checking generic (d7000000 e00000) vs hw (d8000000 8000000)
checking generic (d7000000 e00000) vs hw (d6000000 2000000)
fb: conflicting fb hw usage nouveaufb vs VESA VGA - removing generic driver
Console: switching to colour dummy device 80x25
[drm] nouveau 0000:01:00.0: Attempting to load BIOS image from PRAMIN
[drm] nouveau 0000:01:00.0: ... appears to be valid
[drm] nouveau 0000:01:00.0: BIT BIOS found
[drm] nouveau 0000:01:00.0: Bios version 70.10.48.00
[drm] nouveau 0000:01:00.0: TMDS table version 2.0
[drm] nouveau 0000:01:00.0: MXM: no VBIOS data, nothing to do
[drm] nouveau 0000:01:00.0: DCB version 4.0
[drm] nouveau 0000:01:00.0: DCB outp 00: 02000300 00000000
[drm] nouveau 0000:01:00.0: DCB outp 01: 01000302 00020030
[drm] nouveau 0000:01:00.0: DCB outp 02: 04011380 00000000
[drm] nouveau 0000:01:00.0: DCB outp 03: 08011382 00020030
[drm] nouveau 0000:01:00.0: DCB outp 04: 02022362 00020010
[drm] nouveau 0000:01:00.0: DCB conn 00: 00001030
[drm] nouveau 0000:01:00.0: DCB conn 01: 00010130
[drm] nouveau 0000:01:00.0: DCB conn 02: 00002261
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 0 at offset 0x6BD4
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 1 at offset 0x7296
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 2 at offset 0x878C
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 3 at offset 0x8796
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 4 at offset 0x8A0A
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table at offset 0x8A6F
[drm] nouveau 0000:01:00.0: 0x8A6F: Condition still not met after 20ms, skipping following opcodes
[TTM] Zone kernel: Available graphics memory: 4027584 kiB.
[TTM] Zone dma32: Available graphics memory: 2097152 kiB.
[TTM] Initializing pool allocator.
[TTM] Initializing DMA pool allocator.
[drm] nouveau 0000:01:00.0: Detected 1536MiB VRAM (GDDR5)
[drm] nouveau 0000:01:00.0: 512 MiB GART (aperture)
[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] No driver support for vblank timestamp query.
[drm] nouveau 0000:01:00.0: 0 available performance level(s)
[drm] nouveau 0000:01:00.0: c: core 50MHz shader 101MHz memory 135MHz voltage 963mV fanspeed 41%
[drm] nouveau 0000:01:00.0: allocated 1600x900 fb: 0x1e0000, bo ffff880232311400
fbcon: nouveaufb (fb0) is primary device
Console: switching to colour frame buffer device 200x56
fb0: nouveaufb frame buffer device
drm: registered panic notifier
[drm] Initialized nouveau 0.0.16 20090420 for 0000:01:00.0 on minor 0
[gdev] Loading module...
[gdev] Found 1 physical device(s).
[gdev] Configured 4 virtual device(s).
[gdev] Opened gdev0
[gdev] Closed gdev0
[gdev] Opened gdev0
[gdev] Closed gdev0
...... ...... ......
[gdev] Opened gdev0
[gdev] Closed gdev0
[gdev] Opened gdev0
[gdev] Closed gdev0
[drm] nouveau 0000:01:00.0: PFIFO: write fault at 0x00000b9000 [PAGE_NOT_PRESENT] from BAR3/BAR_WRITE on channel 0x0000058000
[drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000
[drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0000000000 [PT_NOT_PRESENT] from PGRAPH/CTXCTL on channel 0x00215bf000
[gdev] Opened gdev0
[drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon

The problem with 3.3.8 and GTX580 seemed to be about PFIFO, while the problem with 3.3.0 and GTX480 seemed to be about PGRAPH. Any ideas?

@shinpei0208
Owner
@GoelDeepak

I am not sure if the issue mentioned here is same as what I am observing. I am running linux 3.5.0 using nouveau for GTX480. I have applied the gdev-nouveau-3.5.patch to my kernel. Gdev module got loaded successfully. Then, I tried memcpy test given under test directory. The test passed but I observed the following messages in dmesg:

......
......
[gdev] Loading module...
[gdev] Found 1 physical device(s).
[gdev] Configured 4 virtual device(s).
[gdev] Gdev#0 compute scheduler running
[gdev] Gdev#0 memory scheduler running
[gdev] Gdev#0 compute reserve running
[gdev] Gdev#0 memory reserve running
[gdev] Gdev#1 compute scheduler running
[gdev] Gdev#1 memory scheduler running
[gdev] Gdev#1 compute reserve runnining
[gdev] Gdev#1 memory reserve running
[gdev] Gdev#2 compute scheduler running
[gdev] Gdev#2 memory scheduler running
[gdev] Gdev#2 compute reserve running
[gdev] Gdev#2 memory reserve running
[gdev] Gdev#3 compute scheduler running
[gdev] Gdev#3 memory scheduler running
[gdev] Gdev#3 compute reserve running
[gdev] Gdev#3 memory reserve running
INFO: task gschedc0:1607 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gschedc0 D 0000000000000000 0 1607 2 0x00000000
ffff88041c355ea0 0000000000000046 ffff8804049a5b40 00000000000136c0
ffff88041c355fd8 ffff88041c354010 00000000000136c0 00000000000136c0
ffff88041c355fd8 00000000000136c0 ffff8804055516d0 ffff8804049a5b40
Call Trace:
[ffffffffa0275890] ? __gdev_sched_mem_thread+0x90/0x90 [gdev]
[ffffffff816248b9] schedule+0x29/0x70
[ffffffffa02758ee] __gdev_sched_com_thread+0x5e/0x90 [gdev]
[ffffffffa0275890] ? __gdev_sched_mem_thread+0x90/0x90 [gdev]
[ffffffff81075d0e] kthread+0x9e/0xb0
[ffffffff8162ebe4] kernel_thread_helper+0x4/0x10
[ffffffff81075c70] ? kthread_freezable_should_stop+0x70/0x70
[ffffffff8162ebe0] ? gs_change+0x13/0x13
[drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0000000000 [PT_NOT_PRESENT] from PGRAPH/CTXCTL on channel 0x0000132000
[drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x40000000
[gdev] Opened gdev0
[drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon
.......
.......

Also, only the memcpy tests seem to be passing for me and all the others in the repo, namely madd, memcpy_2step etc., hung with the above crash. Any ideas on what could be going wrong?

Thanks for your help

@shinpei0208
Owner

The patch for 3.5.0 is provided by mharsch, but I have not yet looked into its behavior. Using 3.3.x is safer. In fact, Nouveau changes the kernel interface very frequently. Some functions may not work as expected by Gdev. Do you have any reason that you have to stick to 3.5.0?
If the same problem happens with 3.3.x, then it's a real issue though...

@GoelDeepak

Thanks for your response. I went back to 3.3.x and the behavior seems much better now. I could run many example test cases, both in user and kernel modes. Though the tests are passing, I still see some messages in dmesg:

INFO: task gschedm0:1551 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gschedm0 D ffffffff8180c4a0 0 1551 2 0x00000000

ffff8804060b5ea0 0000000000000046 ffff8804189cc470 00000000000134c0
ffff8804060b5fd8 ffff8804060b4010 00000000000134c0 00000000000134c0
ffff8804060b5fd8 00000000000134c0 ffff88041d554470 ffff8804189cc470
Call Trace:
[ffffffffa0220800] ? gdev_sched_create_scheduler+0x1a0/0x1a0 [gdev]
[ffffffff8160a33f] schedule+0x3f/0x60
[ffffffffa022085e] __gdev_sched_mem_thread+0x5e/0x90 [gdev]
[ffffffffa0220800] ? gdev_sched_create_scheduler+0x1a0/0x1a0 [gdev]
[ffffffff810711fe] kthread+0x9e/0xb0
[ffffffff81614664] kernel_thread_helper+0x4/0x10
[ffffffff81071160] ? kthread_freezable_should_stop+0x70/0x70
[ffffffff81614660] ? gs_change+0x13/0x13
[gdev] Opened gdev0
[gdev] Closed gdev0
[HtoD: 0
DtoH: 0
Test passed

Below are the steps I followed to setup gdev :

  1. Compile and run upstream linux 3.3.x
  2. Compiled nouveau-3.3.0 under gdev and followed the steps mentioned in github
  3. On running kernel/memcpy test case, I got the above messages in dmesg.

Any thoughts on why the process might be getting hung ?

@wkbjerry
Collaborator

It's due to the scheduler kthreads of virtual gdev devices 1-3 idle for longer than 120s. This is normal since virtual gdev devices 1-3 never get used in current implementation. Just ignore it.

Alternatively, you can define GDEV_SCHED_DISABLED during compiling the gdev kernel module. In this way, scheduling of memory copy and kernel launch will be disabled, and scheduler kthreads won't be created. You won't see the dmesg again.

@shinpei0208
Owner

Hmm, perhaps we should set GDEV_SCHED_DISABLED by default... I don't have a git client atm.

wkbjerry, do you think you can change the default setting?

@wkbjerry
Collaborator
@wkbjerry
Collaborator

The default setting has been changed. Now compute and memory scheduling are disabled by default when compiling gdev.ko. You can remove the -D GDEV_SCHED_DISABLED flag in mod/gdev/Makefile to re-enable scheduling.

@GoelDeepak

I don't see those messages in dmesg any more. Thanks for your help. I could now run all the tests under gdev except two loop_repeat and memcpy_async.

While running memcpy_async I got the following crash:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [] nouveau_bo_vma_find+0x28/0x50 [nouveau]
PGD 408b21067 PUD 408a21067 PMD 0
Oops: 0000 [#1] SMP
CPU 0
Modules linked in: kcuda(O) gdev(O) adt7475 hwmon_vid nouveau(O) bnep rfcomm bluetooth kvm_amd kvm parport_pc ppdev snd_hda_codec_hdmi binfmt_misc snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ttm(O) drm_kms_helper(O) drm(O) snd mxm_wmi soundcore video snd_page_alloc wmi psmouse i2c_piix4 k10temp serio_raw dm_multipath lp parport sdhci_pci sdhci pata_atiixp r8169 ahci libahci xhci_hcd [last unloaded: nouveau]

Pid: 1614, comm: user_test Tainted: G O 3.3.0 #9 MSI MS-7695/A75A-G55 (MS-7695)
RIP: 0010:[] [] nouveau_bo_vma_find+0x28/0x50 [nouveau]
RSP: 0018:ffff88041c68bda8 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff8804059ef3c0 RCX: ffff8804059ef250
RDX: ffff8804059ef250 RSI: ffff880405f42800 RDI: ffff8804059ef5d0
RBP: ffff88041c68bda8 R08: 0000000000016290 R09: ffff88041f002700
R10: ffffffff81148a7d R11: 0000000000000000 R12: 0000000008ca5000
R13: ffff880408b57800 R14: 0000000000001000 R15: 0000000000001000
FS: 00007f4640177720(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 000000041c0d3000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process user_test (pid: 1614, threadinfo ffff88041c68a000, task ffff880409d3c470)
Stack:
ffff88041c68bde8 ffffffffa0421e20 ffff88041c68bdd8 ffffffff8160b09f
ffff88041c68be10 ffff8804059ef240 ffff880408b53800 ffff880408b53970
ffff88041c68be38 ffffffffa0168589 ffff8804059ef3c0 0000000008ca5000
Call Trace:
[ffffffffa0421e20] gdev_drv_bo_free+0x50/0x80 [nouveau]
[ffffffff8160b09f] ? _raw_spin_lock_irqsave+0x2f/0x40
[ffffffffa0168589] gdev_raw_mem_free+0x49/0x70 [gdev]
[ffffffffa016fbe3] gdev_mem_free+0xc3/0xd0 [gdev]
[ffffffffa016fc69] gdev_mem_gc+0x79/0xc0 [gdev]
[ffffffffa016c6cb] gclose+0x5b/0xa0 [gdev]
[ffffffffa0168d0a] gdev_release+0x1a/0x40 [gdev]
[ffffffff8117023a] fput+0xea/0x270
[ffffffff8116c313] filp_close+0x63/0x90
[ffffffff8116c3f5] sys_close+0xb5/0x120
[ffffffff816132a9] system_call_fastpath+0x16/0x1b
Code: 0f 1f 00 55 48 89 e5 66 66 66 66 90 48 8b 87 10 02 00 00 48 81 c7 10 02 00 00 48 39 f8 75 0c eb 1a 66 90 48 8b 00 48 39 f8 74 10 <48> 39 70 18 0f 1f 40 00 75 ee c9 c3 0f 1f 40 00 31 c0 c9 0f 1f
RIP [ffffffffa0390798] nouveau_bo_vma_find+0x28/0x50 [nouveau]
RSP
CR2: 0000000000000018
---[ end trace d523fd33992227ab ]---

I didn't get a chance to look at the crash more closely, but thought of sharing it with you to check if you have seen it before or are aware of what could be causing it.

@shinpei0208
Owner

This is a bug. Thanks for the report. Gdev uses Nouveau's function (nouveau_bo_vma_find) in a way that Nouveau does not expect... I have seen this problem before, though it was not in the memcpy_async test. This is one of the big issues that we have to look into.

@wkbjerry wkbjerry closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.