Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel oops when running hip kernel with dev branch ROCR/ROCK #15

Closed
mattmacy opened this issue Mar 24, 2016 · 26 comments
Closed

kernel oops when running hip kernel with dev branch ROCR/ROCK #15

mattmacy opened this issue Mar 24, 2016 · 26 comments

Comments

@mattmacy
Copy link

I was able to do the tutorial on gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:

/*
  * Build a table of valid compute devices.
  */
 auto accs = hc::accelerator::get_all();
 int deviceCnt = 0;
 for (int i=0; i<accs.size(); i++) {
     if (! accs[i].get_is_emulated()) {
         deviceCnt++;
     }
 };
 -
 +    printf("actual device count is %d\n", deviceCnt);
 // Make sure the hip visible devices are within the deviceCnt range
 for (int i = 0; i < g_hip_visible_devices.size(); i++) {
     if(g_hip_visible_devices[i] >= deviceCnt){
         // Make sure any DeviceID after invalid DeviceID will be erased.
         g_hip_visible_devices.resize(i);
         break;
     }
 }

But I can't even get it to compile:
~/devel/HIP2$ make
./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o
src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)
#error (USE_AM_TRACKER requries HCC version of 16074 or newer)
^
Died at ./bin/hipcc line 208.
Makefile:20: recipe for target 'src/hip_hcc.o' failed
make: *** [src/hip_hcc.o] Error 1

I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc:
mmacy@pandemonium:~/devel/HIP2$ hipcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

@bensander
Copy link
Contributor

Hi Matt, please try setting env var HIP_PLATFORM to "hcc" so hip will recognize the nanos.

On Mar 23, 2016, at 11:36 PM, Matthew Macy <notifications@github.commailto:notifications@github.com> wrote:

I was able to do the tutorial on gpuopen.comhttp://gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:

/*

  • Build a table of valid compute devices.
    */
    auto accs = hc::accelerator::get_all();
    int deviceCnt = 0;
    for (int i=0; i<accs.size(); i++) {
    if (! accs[i].get_is_emulated()) {
    deviceCnt++;
    }
    };
  • printf("actual device count is %d\n", deviceCnt); // Make sure the hip visible devices are within the deviceCnt range for (int i = 0; i < g_hip_visible_devices.size(); i++) { if(g_hip_visible_devices[i] >= deviceCnt){ // Make sure any DeviceID after invalid DeviceID will be erased. g_hip_visible_devices.resize(i); break; } }

But I can't even get it to compile:
~/devel/HIP2$ make
./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o
src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)
#error (USE_AM_TRACKER requries HCC version of 16074 or newer)
^
Died at ./bin/hipcc line 208.
Makefile:20: recipe for target 'src/hip_hcc.o' failed
make: *** [src/hip_hcc.o] Error 1

I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc:
mmacy@pandemonium:~/devel/HIP2$ hipcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/15

@mattmacy
Copy link
Author

I see - that tells it which compiler to use.

hipcc square.cpp
In file included from /home/mmacy/devel/HIP/src/hip_hcc.cpp:42:
In file included from /home/mmacy/devel/HIP/include/hip_runtime.h:54:
In file included from /home/mmacy/devel/HIP/include/hcc_detail/hip_runtime.h:41:
In file included from /home/mmacy/devel/HIP/include/hip_runtime_api.h:196:
/home/mmacy/devel/HIP/include/hcc_detail/hip_runtime_api.h:35:2: error: ("This version of HIP requires a newer version of HCC.");
#error("This version of HIP requires a newer version of HCC.");
^
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2354:17: warning: unused variable 'stream' [-Wunused-variable]
hipStream_t stream = ihipSyncAndResolveStream(hipStreamNull);
^
1 warning and 1 error generated.
remake-deps failed at /home/mmacy/devel/HIP/bin/hipcc line 179.

That doesn't work so well.

I've installed the most recent .deb from https://bitbucket.org/multicoreware/hcc/downloads.
I take it I need to download the hcc sources as well?

I see. Their latest .deb is 16045. Your sources require 16074 or later.

I'm trying the following to see if I get a working hcc:
https://bitbucket.org/multicoreware/hcc/wiki/Developer%20Information

@mattmacy
Copy link
Author

Progress. I'm running 16124. It looks like you're out of sync with hsa:
mmacy@pandemonium:/devel/hcc/build$ hcc --version
HCC clang version 3.5.0 (based on HCC 0.10.16124-89bbf6f-7e4cd9e LLVM 3.5.0svn)
Target: x86_64-unknown-linux-gnu
Thread model: posix
mmacy@pandemonium:
/devel/hcc/build$ cd ../..
mmacy@pandemonium:/devel$ cd HIP/samples/0_Intro/
bit_extract/ square/
mmacy@pandemonium:
/devel$ cd HIP/samples/0_Intro/square/
mmacy@pandemonium:~/devel/HIP/samples/0_Intro/square$ hipcc square.cpp
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2093:22: error: no matching function for call to 'hsa_amd_memory_async_copy'
hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, locked_srcp, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided
hsa_amd_memory_async_copy(void* dst, const void* src, size_t size,
^
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2155:35: error: no matching function for call to 'hsa_amd_memory_async_copy'
hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, _pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided
hsa_amd_memory_async_copy(void* dst, const void* src, size_t size,
^
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2208:39: error: no matching function for call to 'hsa_amd_memory_async_copy'
hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, srcp0, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided
hsa_amd_memory_async_copy(void* dst, const void* src, size_t size,
^
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2333:35: error: no matching function for call to 'hsa_amd_memory_async_copy'
hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal);
^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided
hsa_amd_memory_async_copy(void* dst, const void* src, size_t size,
^
/home/mmacy/devel/HIP/src/hip_hcc.cpp:2463:39: error: no matching function for call to 'hsa_amd_memory_async_copy'
hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal);
^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided
hsa_amd_memory_async_copy(void* dst, const void* src, size_t size,
^
5 errors generated.
remake-deps failed at /home/mmacy/devel/HIP/bin/hipcc line 179.

@mattmacy
Copy link
Author

I don't know what the situation is with the ROCR_V2 API. The async memcpy in what I assume is the canonical hsa_ext_amd.h:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h
looks more like the old one.

I made the following changes to hip_hcc.cpp to get my square.cpp to compile using hcc as the HIP_PLATFORM:

index 57d55a1..776e7c6 100644
--- a/src/hip_hcc.cpp
+++ b/src/hip_hcc.cpp
@@ -2090,7 +2090,7 @@ void StagingBuffer::CopyHostToDevicePinInPlace(void* dst, const void* src, size_
         hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);

 #if USE_ROCR_V2
-        hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, locked_srcp, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+        hsa_status = hsa_amd_memory_async_copy(dstp, locked_srcp, theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
         assert(0);
 #endif
@@ -2152,7 +2152,7 @@ void StagingBuffer::CopyHostToDevice(void* dst, const void* src, size_t sizeByte
         hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);

 #if USE_ROCR_V2
-        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, _pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp,  _pinnedStagingBuffer[bufferIndex], theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
         hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _pinnedStagingBuffer[bufferIndex], theseBytes, _device->_hsa_agent, 0, NULL, _completion_signal[bufferIndex]);
 #endif
@@ -2205,7 +2205,7 @@ void StagingBuffer::CopyDeviceToHost(void* dst, const void* src, size_t sizeByte
             tprintf (TRACE_COPY2, "D2H: bytesRemaining0=%zu  async_copy %zu bytes src:%p to staging:%p\n", bytesRemaining0, theseBytes, srcp0, _pinnedStagingBuffer[bufferIndex]);
             hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);
 #if USE_ROCR_V2
-            hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, srcp0, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+            hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], srcp0, theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
             hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], srcp0, theseBytes, _device->_hsa_agent, 0, NULL, _completion_signal[bufferIndex]);
 #endif
@@ -2330,7 +2330,7 @@ void ihipSyncCopy(ihipStream_t *stream, void* dst, const void* src, size_t sizeB


 #if USE_ROCR_V2
-        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal);
+        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal);
 #else
         hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, 0, NULL, device->_copy_signal);
 #endif
@@ -2460,7 +2460,7 @@ hipError_t hipMemcpyAsync(void* dst, const void* src, size_t sizeBytes, hipMemcp

             tprintf (TRACE_SYNC, " copy-async, waitFor=%lu completion=#%lu(%lu)\n", depSignalCnt? depSignal.handle:0x0, ihip_signal->_sig_id, ihip_signal->_hsa_signal.handle);

-            hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal);
+            hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal);
 #else
             hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, 0, NULL, ihip_signal->_hsa_signal);


@whchung
Copy link
Contributor

whchung commented Mar 24, 2016

Hi Matthew, can you try switch to "dev" branch on both ROCK-Kernel-Driver and ROCR-Runtime? You shall be able to find newer async_copy API which works with HIP over there.

@mattmacy
Copy link
Author

What do I do to just re-build the driver?
Thanks.

@mattmacy
Copy link
Author

And for that matter - how do I rebuild the runtime. There's no makefile in the root.

@whchung
Copy link
Contributor

whchung commented Mar 24, 2016

Hi Matthew, you don't need to build them. On "dev" branch of ROCK-Kernel-Driver you can find a "package" directory which has ubuntu & fedora packages inside. And you can also find pre-built packages under "package" directory in ROCR-Runtime. Please do remember to switch to "dev" branch on both repositories though.

@mattmacy
Copy link
Author

OK. Great. Thanks. I'll do that in the morning and let you know how that goes. In the meantime the patched version works for me.

I do notice that AMD kernels are much slower than Nvidia kernels:

mmacy@pandemonium:~/devel/HIP/samples/0_Intro/square$ time !!
time ./a.out
deviceCount: 2
info: running on device Fiji
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

real 0m1.203s
user 0m0.088s
sys 0m0.184s

mmacy@pandemonium:~/devel/HIP.old/samples/0_Intro/square$ time ./square.hip.out
deviceCount: 1
info: running on device GeForce GTX 980 Ti
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

real 0m0.273s
user 0m0.028s
sys 0m0.244s

Is that fundamental? Or does your job dispatch interface just need refinement?

Thanks.

@whchung
Copy link
Contributor

whchung commented Mar 24, 2016

Hi Matthew, there are many ongoing works to optimize all aspects of the stack. Please stay tuned. :)

@mattmacy
Copy link
Author

OK. I updated both the kernel and the runtime to the 316 build. When running the square.cpp example with HIP_PLATFORM=hcc (nvcc still works fine) I now get a kernel oops:

Mar 24 11:28:34 pandemonium kernel: [ 639.895604] nvidia_uvm: Loaded the UVM driver, major device number 245
Mar 24 11:29:06 pandemonium kernel: [ 671.693636] amdgpu: vram aperture is out of 40bit address base: 0x383fc0000000 limit 0x383fd0000000
Mar 24 11:29:06 pandemonium kernel: [ 671.693749] amdgpu: vram aperture is out of 40bit address base: 0x383fe0000000 limit 0x383ff0000000
Mar 24 11:29:06 pandemonium kernel: [ 671.696239] amdgpu: vram aperture is out of 40bit address base: 0x383fc0000000 limit 0x383fd0000000
Mar 24 11:29:06 pandemonium kernel: [ 671.734321] amdgpu: vram aperture is out of 40bit address base: 0x383fe0000000 limit 0x383ff0000000
Mar 24 11:29:06 pandemonium kernel: [ 671.776858] BUG: unable to handle kernel paging request at ffffc90019ecd000
Mar 24 11:29:06 pandemonium kernel: [ 671.776863] IP: [] set_trap_handler+0x1a/0x30 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.776879] PGD ffec8f067 PUD ffeca0067 PMD fcf5f2067 PTE 0
Mar 24 11:29:06 pandemonium kernel: [ 671.776883] Oops: 0002 [#1] SMP
Mar 24 11:29:06 pandemonium kernel: [ 671.776886] Modules linked in: nvidia_uvm(POE) vmw_vsock_vmci_transport vsock vmw_vmci rfcomm bnep binfmt_misc hid_logitech_hidpp btusb btbcm btintel bluetooth b43 mac80211 nls_iso8859_1 cfg80211 ssb intel_rapl iosf_mbi x86_pkg_temp_thermal eeepc_wmi intel_powerclamp coretemp asus_wmi sparse_keymap video mxm_wmi kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel nvidia(POE) aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw sb_edac edac_core snd_hda_codec_realtek snd_usb_audio snd_hda_codec_generic snd_usbmidi_lib snd_seq_midi hid_logitech_dj snd_seq_midi_event snd_hda_codec_hdmi snd_rawmidi snd_hda_intel snd_hda_controller snd_hda_codec snd_seq snd_hda_core snd_hwdep snd_seq_device bcma snd_pcm snd_timer snd mei_me lpc_ich soundcore mei shpchp wmi tpm_infineon mac_hid parport_pc ppdev lp parport autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit e1000e ttm drm_kms_helper ahci ptp libahci drm pps_core
Mar 24 11:29:06 pandemonium kernel: [ 671.776940] CPU: 0 PID: 4040 Comm: a.out Tainted: P OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82
Mar 24 11:29:06 pandemonium kernel: [ 671.776943] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015
Mar 24 11:29:06 pandemonium kernel: [ 671.776944] task: ffff880e71d93250 ti: ffff880ea4020000 task.ti: ffff880ea4020000
Mar 24 11:29:06 pandemonium kernel: [ 671.776946] RIP: 0010:[] [] set_trap_handler+0x1a/0x30 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.776955] RSP: 0018:ffff880ea4023d48 EFLAGS: 00010286
Mar 24 11:29:06 pandemonium kernel: [ 671.776956] RAX: ffffc90019ecd000 RBX: ffff880ff0f92e00 RCX: 0000000000000000
Mar 24 11:29:06 pandemonium kernel: [ 671.776957] RDX: 0000000002400000 RSI: ffff880ff3251e20 RDI: ffff880ff0f92a00
Mar 24 11:29:06 pandemonium kernel: [ 671.776959] RBP: ffff880ea4023d48 R08: ffff880ff0f92a00 R09: 0000000000000000
Mar 24 11:29:06 pandemonium kernel: [ 671.776960] R10: ffff880f962af800 R11: 00007ffc39269e80 R12: ffff880ea4023dc0
Mar 24 11:29:06 pandemonium kernel: [ 671.776961] R13: ffff880fb1275018 R14: ffff880fb1275000 R15: ffff880ea4023dc0
Mar 24 11:29:06 pandemonium kernel: [ 671.776963] FS: 00007f8005ebc740(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000
Mar 24 11:29:06 pandemonium kernel: [ 671.776965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 24 11:29:06 pandemonium kernel: [ 671.776966] CR2: ffffc90019ecd000 CR3: 0000000f3f50b000 CR4: 00000000001407f0
Mar 24 11:29:06 pandemonium kernel: [ 671.776968] Stack:
Mar 24 11:29:06 pandemonium kernel: [ 671.776969] ffff880ea4023d78 ffffffffc03d8883 ffff880ea4023dc0 fffffffffffffff2
Mar 24 11:29:06 pandemonium kernel: [ 671.776972] 000000000000001a 00000000fffffff2 ffff880ea4023e78 ffffffffc03d9ebf
Mar 24 11:29:06 pandemonium kernel: [ 671.776975] ffff880f962af800 ffffffffc03d8810 ffff880fb1275000 00007ffc39269e80
Mar 24 11:29:06 pandemonium kernel: [ 671.776977] Call Trace:
Mar 24 11:29:06 pandemonium kernel: [ 671.776986] [] kfd_ioctl_set_trap_handler+0x73/0xc0 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.776994] [] kfd_ioctl+0x2bf/0x4d0 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.777001] [] ? kfd_ioctl_get_process_apertures+0x2e0/0x2e0 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.777010] [] ? pte_alloc_one+0x30/0x50
Mar 24 11:29:06 pandemonium kernel: [ 671.777015] [] ? __pte_alloc+0xcc/0x180
Mar 24 11:29:06 pandemonium kernel: [ 671.777019] [] do_vfs_ioctl+0x2f8/0x510
Mar 24 11:29:06 pandemonium kernel: [ 671.777023] [] ? __do_page_fault+0x1b6/0x450
Mar 24 11:29:06 pandemonium kernel: [ 671.777026] [] SyS_ioctl+0x81/0xa0
Mar 24 11:29:06 pandemonium kernel: [ 671.777028] [] ? do_page_fault+0x30/0x80
Mar 24 11:29:06 pandemonium kernel: [ 671.777032] [] system_call_fastpath+0x16/0x75
Mar 24 11:29:06 pandemonium kernel: [ 671.777034] Code: 00 0f 1f 44 00 00 55 31 c0 48 89 e5 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 8b 46 f0 48 89 e5 8b 80 f4 01 00 00 48 03 86 e0 00 00 00 <48> 89 10 48 89 48 08 31 c0 5d c3 66 66 2e 0f 1f 84 00 00 00 00
Mar 24 11:29:06 pandemonium kernel: [ 671.777061] RIP [] set_trap_handler+0x1a/0x30 [amdkfd]
Mar 24 11:29:06 pandemonium kernel: [ 671.777068] RSP
Mar 24 11:29:06 pandemonium kernel: [ 671.777069] CR2: ffffc90019ecd000
Mar 24 11:29:06 pandemonium kernel: [ 671.777072] ---[ end trace 53807749a7eb2ed3 ]---

Should I go back to the 1/25 version of driver/runtime with my local patch or is this likely to be fixed?
I'm happy to provide more info if need be.

@mattmacy mattmacy changed the title HIP only able to see Nvidia GPU0 and unable to build hip_hcc.cpp to instrument kernel oops when running hip kernel with dev branch ROCR/ROCK Mar 24, 2016
@mattmacy
Copy link
Author

I created an issue in with ROCK as that is probably where the current problem belongs.

@aditya4d1
Copy link
Contributor

Hi,
Make sure you install debian files for ROCK dev branch.
For runtime, make sure to install ROCR dev branch.
Test the sample in /opt/hsa/sample.

If your sample is not passing, ROCR or ROCK is not working as it should be.

If it pass, get compiler (HCC and LLVM), follow https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra. Make sure you run conformance test given in the wiki for the repo.

Then, add /opt/hsa to HSA_PATH, /opt/hcc to HCC_PATH. Do the same for adding bin directories to PATH and lib to LD_LIBRARY_PATH.

Get hip and add its project directory to HIP_PATH and hipcc directory to PATH.

@mattmacy
Copy link
Author

See previous comment "OK. I updated both the kernel and the runtime to the 316 build." That's the dev kernel. I also installed the dev runtime so that hip_hcc.cpp will compile with the ROCR_V2 copy interface. And that is what is causing this panic.

@mattmacy
Copy link
Author

My sample passed fine until I tried the latest kernel and runtime. So all the other options are correct.

@aditya4d1
Copy link
Contributor

Can you try running hsa sample?

@mattmacy
Copy link
Author

I'm no longer able to boot the dev kernel. It also complains of not properly detecting my graphics hardware - so needs to run in low-resolution, but instead never displays a login prompt. I'm not sure what I need to do to recover at this point. The default ubuntu kernel still works OK.

@mattmacy
Copy link
Author

Looking at the logs It seems I'm seeing further OOPS at boot now:
Mar 24 11:53:42 pandemonium rsyslogd: rsyslogd's userid changed to 104
Mar 24 11:53:43 pandemonium kernel: [ 13.184444] NVRM: Your system is not currently configured to drive a VGA console
Mar 24 11:53:43 pandemonium kernel: [ 13.184446] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
Mar 24 11:53:43 pandemonium kernel: [ 13.184447] NVRM: requires the use of a text-mode VGA console. Use of other console
Mar 24 11:53:43 pandemonium kernel: [ 13.184448] NVRM: drivers including, but not limited to, vesafb, may result in
Mar 24 11:53:43 pandemonium kernel: [ 13.184449] NVRM: corruption and stability problems, and is not supported.
Mar 24 11:53:43 pandemonium kernel: [ 13.312533] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Mar 24 11:53:43 pandemonium kernel: [ 13.312537] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312554] PGD 0
Mar 24 11:53:43 pandemonium kernel: [ 13.312556] Oops: 0000 [#1] SMP
Mar 24 11:53:43 pandemonium kernel: [ 13.312558] Modules linked in: vmw_vsock_vmci_transport vsock vmw_vmci b43 mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core
Mar 24 11:53:43 pandemonium kernel: [ 13.312584] CPU: 1 PID: 1335 Comm: Xorg Tainted: P OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82
Mar 24 11:53:43 pandemonium kernel: [ 13.312586] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015
Mar 24 11:53:43 pandemonium kernel: [ 13.312587] task: ffff880fcf760000 ti: ffff880ff34d4000 task.ti: ffff880ff34d4000
Mar 24 11:53:43 pandemonium kernel: [ 13.312587] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312595] RSP: 0018:ffff880ff34d7c68 EFLAGS: 00010286
Mar 24 11:53:43 pandemonium kernel: [ 13.312596] RAX: ffff880ff17a83c0 RBX: ffff880ff0d7d400 RCX: 0000000000000001
Mar 24 11:53:43 pandemonium kernel: [ 13.312597] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880ff17a8cc0
Mar 24 11:53:43 pandemonium kernel: [ 13.312597] RBP: ffff880ff34d7cd8 R08: 000000000001a920 R09: ffff880ff17a83c0
Mar 24 11:53:43 pandemonium kernel: [ 13.312598] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff4a239c0
Mar 24 11:53:43 pandemonium kernel: [ 13.312599] R13: ffff880ff31ee110 R14: ffff880ff5c16200 R15: ffff880ff06f0000
Mar 24 11:53:43 pandemonium kernel: [ 13.312599] FS: 00007f705b545980(0000) GS:ffff880fff240000(0000) knlGS:0000000000000000
Mar 24 11:53:43 pandemonium kernel: [ 13.312600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 24 11:53:43 pandemonium kernel: [ 13.312601] CR2: 0000000000000010 CR3: 0000000ff6cd5000 CR4: 00000000001407e0
Mar 24 11:53:43 pandemonium kernel: [ 13.312602] Stack:
Mar 24 11:53:43 pandemonium kernel: [ 13.312602] ffff880ff34d7cd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800
Mar 24 11:53:43 pandemonium kernel: [ 13.312604] 0000000200000000 ffff880ff382ee00 ffff880ff17a83c0 0000000000000002
Mar 24 11:53:43 pandemonium kernel: [ 13.312605] 0000000000000000 ffff880ff31ee110 ffff880ff382ee00 ffff880ff4a239c0
Mar 24 11:53:43 pandemonium kernel: [ 13.312606] Call Trace:
Mar 24 11:53:43 pandemonium kernel: [ 13.312615] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312623] [] drm_ioctl+0x379/0x6a0 [drm]
Mar 24 11:53:43 pandemonium kernel: [ 13.312630] [] ? amdgpu_bo_list_free+0x90/0x90 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312635] [] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312638] [] do_vfs_ioctl+0x2f8/0x510
Mar 24 11:53:43 pandemonium kernel: [ 13.312640] [] ? __do_page_fault+0x1b6/0x450
Mar 24 11:53:43 pandemonium kernel: [ 13.312642] [] SyS_ioctl+0x81/0xa0
Mar 24 11:53:43 pandemonium kernel: [ 13.312643] [] ? do_page_fault+0x30/0x80
Mar 24 11:53:43 pandemonium kernel: [ 13.312645] [] system_call_fastpath+0x16/0x75
Mar 24 11:53:43 pandemonium kernel: [ 13.312646] Code: 00 00 48 8b bb 08 01 00 00 e8 57 67 fe ff 84 c0 0f 85 2f ff ff ff 8b 45 b0 8d 48 01 48 8d 04 80 48 c1 e0 04 48 03 45 c0 48 8b 10 <8b> 52 10 83 fa 04 89 50 30 0f 84 2b 01 00 00 89 50 34 48 89 18
Mar 24 11:53:43 pandemonium kernel: [ 13.312659] RIP [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:43 pandemonium kernel: [ 13.312665] RSP
Mar 24 11:53:43 pandemonium kernel: [ 13.312666] CR2: 0000000000000010
Mar 24 11:53:43 pandemonium kernel: [ 13.312667] ---[ end trace df8106f7c32c327f ]---
Mar 24 11:53:43 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Mar 24 11:53:43 pandemonium nvidia-persistenced: Shutdown (1372)
Again moments later:

Mar 24 11:53:48 pandemonium nvidia-persistenced: Started (1640)
Mar 24 11:53:49 pandemonium kernel: [ 19.067885] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Mar 24 11:53:49 pandemonium kernel: [ 19.067889] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.067905] PGD 0
Mar 24 11:53:49 pandemonium kernel: [ 19.067907] Oops: 0000 [#2] SMP
Mar 24 11:53:49 pandemonium kernel: [ 19.067908] Modules linked in: hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib btusb btbcm btintel hid_generic usbhid hid vmw_vsock_vmci_transport vsock vmw_vmci b43
mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_real
tek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd
_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd a
md_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core
Mar 24 11:53:49 pandemonium kernel: [ 19.067935] CPU: 0 PID: 1637 Comm: Xorg Tainted: P D OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82
Mar 24 11:53:49 pandemonium kernel: [ 19.067937] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015
Mar 24 11:53:49 pandemonium kernel: [ 19.067938] task: ffff880fcf79c670 ti: ffff880ff6354000 task.ti: ffff880ff6354000
Mar 24 11:53:49 pandemonium kernel: [ 19.067939] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.067946] RSP: 0018:ffff880ff6357c68 EFLAGS: 00010286
Mar 24 11:53:49 pandemonium kernel: [ 19.067947] RAX: ffff880ff14b9cc0 RBX: ffff880ff61dc800 RCX: 0000000000000001
Mar 24 11:53:49 pandemonium kernel: [ 19.067948] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880ff14b9d80
Mar 24 11:53:49 pandemonium kernel: [ 19.067948] RBP: ffff880ff6357cd8 R08: 000000000001a920 R09: ffff880ff14b9cc0
Mar 24 11:53:49 pandemonium kernel: [ 19.067949] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff4370780
Mar 24 11:53:49 pandemonium kernel: [ 19.067950] R13: ffff880fefd27d90 R14: ffff880ff0ef0700 R15: ffff880ff06f0000
Mar 24 11:53:49 pandemonium kernel: [ 19.067951] FS: 00007fb06dc05980(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000
Mar 24 11:53:49 pandemonium kernel: [ 19.067952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 24 11:53:49 pandemonium kernel: [ 19.067952] CR2: 0000000000000010 CR3: 0000000ff61e0000 CR4: 00000000001407f0
Mar 24 11:53:49 pandemonium kernel: [ 19.067953] Stack:
Mar 24 11:53:49 pandemonium kernel: [ 19.067954] ffff880ff6357cd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800
Mar 24 11:53:49 pandemonium kernel: [ 19.067955] 0000000200000000 ffff880ff2704e00 ffff880ff14b9cc0 0000000000000002
Mar 24 11:53:49 pandemonium kernel: [ 19.067956] 0000000000000000 ffff880fefd27d90 ffff880ff2704e00 ffff880ff4370780
Mar 24 11:53:49 pandemonium kernel: [ 19.067958] Call Trace:
Mar 24 11:53:49 pandemonium kernel: [ 19.067967] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.067976] [] drm_ioctl+0x379/0x6a0 [drm]
Mar 24 11:53:49 pandemonium kernel: [ 19.067983] [] ? amdgpu_bo_list_free+0x90/0x90 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.067988] [] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.067991] [] do_vfs_ioctl+0x2f8/0x510
Mar 24 11:53:49 pandemonium kernel: [ 19.067993] [] ? __do_page_fault+0x1b6/0x450
Mar 24 11:53:49 pandemonium kernel: [ 19.067995] [] SyS_ioctl+0x81/0xa0
Mar 24 11:53:49 pandemonium kernel: [ 19.067996] [] ? do_page_fault+0x30/0x80
Mar 24 11:53:49 pandemonium kernel: [ 19.067998] [] system_call_fastpath+0x16/0x75
Mar 24 11:53:49 pandemonium kernel: [ 19.067999] Code: 00 00 48 8b bb 08 01 00 00 e8 57 67 fe ff 84 c0 0f 85 2f ff ff ff 8b 45 b0 8d 48 01 48 8d 04 80 48 c1 e0 04 48 03 45 c0 48 8b 10 <8b> 52 10 83 fa 04 89 50 30 0f 84 2b 01 00 00 89 50 34 48 89 18
Mar 24 11:53:49 pandemonium kernel: [ 19.068012] RIP [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:49 pandemonium kernel: [ 19.068019] RSP
Mar 24 11:53:49 pandemonium kernel: [ 19.068019] CR2: 0000000000000010
Mar 24 11:53:49 pandemonium kernel: [ 19.068031] ---[ end trace df8106f7c32c3280 ]---
Mar 24 11:53:49 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Mar 24 11:53:49 pandemonium nvidia-persistenced: Shutdown (1640)
Mar 24 11:53:54 pandemonium nvidia-persistenced: Started (1703)
Mar 24 11:53:54 pandemonium nvidia-persistenced: Failed to open PID file: File exists
Mar 24 11:53:54 pandemonium nvidia-persistenced: Shutdown (1710)
Mar 24 11:53:54 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Mar 24 11:53:54 pandemonium nvidia-persistenced: Shutdown (1703)
Mar 24 11:53:54 pandemonium nvidia-persistenced: Started (1741)
Mar 24 11:53:55 pandemonium kernel: [ 25.105818] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Mar 24 11:53:55 pandemonium kernel: [ 25.105822] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:55 pandemonium kernel: [ 25.105838] PGD 0
Mar 24 11:53:55 pandemonium kernel: [ 25.105839] Oops: 0000 [#3] SMP
Mar 24 11:53:55 pandemonium kernel: [ 25.105841] Modules linked in: hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib btusb btbcm btintel hid_generic usbhid hid vmw_vsock_vmci_transport vsock vmw_vmci b43 mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core
Mar 24 11:53:55 pandemonium kernel: [ 25.105869] CPU: 0 PID: 1738 Comm: Xorg Tainted: P D OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82
Mar 24 11:53:55 pandemonium kernel: [ 25.105870] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015
Mar 24 11:53:55 pandemonium kernel: [ 25.105871] task: ffff880034a11e30 ti: ffff8800a6aec000 task.ti: ffff8800a6aec000
Mar 24 11:53:55 pandemonium kernel: [ 25.105872] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu]
Mar 24 11:53:55 pandemonium kernel: [ 25.105880] RSP: 0018:ffff8800a6aefc68 EFLAGS: 00010286
Mar 24 11:53:55 pandemonium kernel: [ 25.105881] RAX: ffff880fefd90f00 RBX: ffff880ff0e2d800 RCX: 0000000000000001
Mar 24 11:53:55 pandemonium kernel: [ 25.105881] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880fefd90840
Mar 24 11:53:55 pandemonium kernel: [ 25.105882] RBP: ffff8800a6aefcd8 R08: 000000000001a920 R09: ffff880fefd90f00
Mar 24 11:53:55 pandemonium kernel: [ 25.105883] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff47c9120
Mar 24 11:53:55 pandemonium kernel: [ 25.105883] R13: ffff880fefd27e90 R14: ffff880ff0d68100 R15: ffff880ff06f0000
Mar 24 11:53:55 pandemonium kernel: [ 25.105884] FS: 00007fbc416d6980(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000
Mar 24 11:53:55 pandemonium kernel: [ 25.105885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 24 11:53:55 pandemonium kernel: [ 25.105886] CR2: 0000000000000010 CR3: 0000000ff6ea4000 CR4: 00000000001407f0
Mar 24 11:53:55 pandemonium kernel: [ 25.105886] Stack:
Mar 24 11:53:55 pandemonium kernel: [ 25.105887] ffff8800a6aefcd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800
Mar 24 11:53:55 pandemonium kernel: [ 25.105889] 0000000200000000 ffff880ff363b400 ffff880fefd90f00 0000000000000002
Mar 24 11:53:55 pandemonium kernel: [ 25.105890] 0000000000000000 ffff880fefd27e90 ffff880ff363b400 ffff880ff47c9120
Mar 24 11:53:55 pandemonium kernel: [ 25.105891] Call Trace:
Mar 24 11:53:55 pandemonium kernel: [ 25.105900] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu]

And so on for all cpus.

@jedwards-AMD
Copy link
Contributor

Do you have the GTX 980Ti, the R9 Nano and the R9 Fury all installed in the same system? If so, did you install the drivers for the GTX card before or after you installed the ROCK packages?

@mattmacy
Copy link
Author

They're all in the same system. I installed the GTX card a couple of weeks ago. The R9s date back to yesterday. I have made no changes to the Nvidia software/hardware configuration in a couple of weeks - i.e. well before doing anything with AMD.

@mattmacy
Copy link
Author

Attaching system profile info.

lspci.txt
dmesg.txt
lsmod.txt

@mattmacy
Copy link
Author

The current status AFAICT is that the development driver won't work except in console-mode because Xorg's probing causes it to crash. So can anyone give me an ETA on when that will be fixed on github?

Thanks.

@aditya4d1
Copy link
Contributor

Hi,
You can revert back to a previous release commit.

@mattmacy
Copy link
Author

It's not clear to me where the problem was introduced. Can you hazard a guess at which changeset to try? The last time packages were updated was Jan 26th which corresponds to what's in master. So I'll need to build my own kernel - which is fine with me provided Kconfig is complete.

@aditya4d1
Copy link
Contributor

Hi,
You can try master branch package. obsidian 62 (if you want to run hcc badly).

@mangupta
Copy link
Contributor

Closing this since the original issue should be occurring anymore.

@mattmacy Please try with a clean setup and reopen the issue if you face any problems.

MikalaiDrabovich added a commit to MikalaiDrabovich/HIP that referenced this issue Aug 4, 2017
Is 'new' keyword supported? Malloc/free way work fine, but not new/delete.

If lines 45-46 added, it compiler error is the following:

ndr@ndr-ROCM16:~/Desktop/square/new$ make clean && make
rm -f *.o square
/opt/rocm/hip/bin/hipcc --amdgpu-target=gfx900 square.cpp -o square
Referencing function in another module!
  %call6.i.i = tail call i8* @_Znam(i64 1024) ROCm#3
; ModuleID = '<stdin>'
i8* (i64)* @_Znam
; ModuleID = '#0 0x000000000142b5ea llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142b5ea)
ROCm#1 0x000000000142968e llvm::sys::RunSignalHandlers() (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142968e)
ROCm#2 0x00000000014297dc SignalHandler(int) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x14297dc)
ROCm#3 0x00007f22f9e4c390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390)
ROCm#4 0x0000000000f81eb9 void llvm::VerifierSupport::CheckFailed<llvm::Instruction*, llvm::Module const*, llvm::GlobalValue*, llvm::Module*>(llvm::Twine const&, llvm::Instruction* const&, llvm::Module const* const&, llvm::GlobalValue* const&, llvm::Module* const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf81eb9)
ROCm#5 0x0000000000f8c8bc (anonymous namespace)::Verifier::visitInstruction(llvm::Instruction&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8c8bc)
ROCm#6 0x0000000000f8f7b2 (anonymous namespace)::Verifier::verifyCallSite(llvm::CallSite) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8f7b2)
#7 0x0000000000f919f5 (anonymous namespace)::Verifier::visitCallInst(llvm::CallInst&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf919f5)
#8 0x0000000000f95381 llvm::InstVisitor<(anonymous namespace)::Verifier, void>::visit(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf95381)
#9 0x0000000000f97264 (anonymous namespace)::Verifier::verify(llvm::Function const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf97264)
ROCm#10 0x0000000000f9831d (anonymous namespace)::VerifierLegacyPass::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf9831d)
ROCm#11 0x0000000000f4459a llvm::FPPassManager::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf4459a)
ROCm#12 0x0000000000f44643 llvm::FPPassManager::runOnModule(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44643)
ROCm#13 0x0000000000f44104 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44104)
ROCm#14 0x0000000000643b74 main (/opt/rocm/hcc-1.0/compiler/bin/opt+0x643b74)
ROCm#15 0x00007f22f8ba9830 __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:325:0
ROCm#16 0x000000000068f729 _start (/opt/rocm/hcc-1.0/compiler/bin/opt+0x68f729)
Stack dump:
0.	Program arguments: /opt/rocm/hcc-1.0/compiler/bin/opt -load /opt/rocm/hcc-1.0/compiler/bin/../lib/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o /tmp/tmp.vqMJlUNjk9/kernel-gfx900.hsaco.promote.bc 
1.	Running pass 'Function Pass Manager' on module '<stdin>'.
2.	Running pass 'Module Verifier' on function '@_ZZ4mainEN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_419__cxxamp_trampolineEPfS0_m'
/opt/rocm/hcc-1.0/compiler/bin/clamp-device: line 140: 18412 Segmentation fault      (core dumped) $OPT -load $LIB/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o $2.promote.bc < $1
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx900
/opt/rocm/hcc/bin/hcc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x1674f1a]
/opt/rocm/hcc/bin/hcc(_ZN4llvm3sys17RunSignalHandlersEv+0x3e)[0x1672fbe]
/opt/rocm/hcc/bin/hcc[0x167310c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f69bbc98390]
[0x7f69bc0c8a10]
Stack dump:
0.	Program arguments: /opt/rocm/hcc/bin/hcc -hc -D__HIPCC__ -I/opt/rocm/hcc/include -I/opt/rocm/hip/include/hip/hcc_detail/cuda -I/opt/rocm/hsa/include -Wno-deprecated-register -I/opt/rocm/profiler/CXLActivityLogger/include -I/opt/rocm/hip/include -DHIP_VERSION_MAJOR=1 -DHIP_VERSION_MINOR=2 -DHIP_VERSION_PATCH=17284 -D__HIP_ARCH_GFX900__=1 -Wl,--rpath=/opt/rocm/hip/lib /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a -hc -std=c++amp -L/opt/rocm/hcc-1.0/lib -Wl,--rpath=/opt/rocm/hcc-1.0/lib -ldl -lm -lpthread -lunwind -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive -lsupc++ -L/opt/rocm/hsa/lib -L/opt/rocm/lib -lhsa-runtime64 -lhc_am -lhsakmt -L/opt/rocm/profiler/CXLActivityLogger/bin/x86_64 -lCXLActivityLogger -Wl,--rpath=/opt/rocm/profiler/CXLActivityLogger/bin/x86_64 -lm --amdgpu-target=gfx900 --amdgpu-target=gfx900 square.cpp -o square 
Died at /opt/rocm/hip/bin/hipcc line 452.
Makefile:19: recipe for target 'square' failed
make: *** [square] Error 255

 With delete [] , the error is 


ndr@ndr-ROCM16:~/Desktop/square/new$ make clean && make
rm -f *.o square
/opt/rocm/hip/bin/hipcc --amdgpu-target=gfx900 square.cpp -o square
Referencing function in another module!
  %call6.i.i = tail call i8* @_Znam(i64 1024) ROCm#3
; ModuleID = '<stdin>'
i8* (i64)* @_Znam
; ModuleID = '#0 0x000000000142b5ea llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142b5ea)
ROCm#1 0x000000000142968e llvm::sys::RunSignalHandlers() (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142968e)
ROCm#2 0x00000000014297dc SignalHandler(int) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x14297dc)
ROCm#3 0x00007f84d4a09390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390)
ROCm#4 0x0000000000f81eb9 void llvm::VerifierSupport::CheckFailed<llvm::Instruction*, llvm::Module const*, llvm::GlobalValue*, llvm::Module*>(llvm::Twine const&, llvm::Instruction* const&, llvm::Module const* const&, llvm::GlobalValue* const&, llvm::Module* const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf81eb9)
ROCm#5 0x0000000000f8c8bc (anonymous namespace)::Verifier::visitInstruction(llvm::Instruction&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8c8bc)
ROCm#6 0x0000000000f8f7b2 (anonymous namespace)::Verifier::verifyCallSite(llvm::CallSite) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8f7b2)
#7 0x0000000000f919f5 (anonymous namespace)::Verifier::visitCallInst(llvm::CallInst&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf919f5)
#8 0x0000000000f95381 llvm::InstVisitor<(anonymous namespace)::Verifier, void>::visit(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf95381)
#9 0x0000000000f97264 (anonymous namespace)::Verifier::verify(llvm::Function const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf97264)
ROCm#10 0x0000000000f9831d (anonymous namespace)::VerifierLegacyPass::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf9831d)
ROCm#11 0x0000000000f4459a llvm::FPPassManager::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf4459a)
ROCm#12 0x0000000000f44643 llvm::FPPassManager::runOnModule(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44643)
ROCm#13 0x0000000000f44104 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44104)
ROCm#14 0x0000000000643b74 main (/opt/rocm/hcc-1.0/compiler/bin/opt+0x643b74)
ROCm#15 0x00007f84d3766830 __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:325:0
ROCm#16 0x000000000068f729 _start (/opt/rocm/hcc-1.0/compiler/bin/opt+0x68f729)
Stack dump:
0.	Program arguments: /opt/rocm/hcc-1.0/compiler/bin/opt -load /opt/rocm/hcc-1.0/compiler/bin/../lib/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o /tmp/tmp.LeiH3VuY4Q/kernel-gfx900.hsaco.promote.bc 
1.	Running pass 'Function Pass Manager' on module '<stdin>'.
2.	Running pass 'Module Verifier' on function '@_ZZ4mainEN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_419__cxxamp_trampolineEPfS0_m'
/opt/rocm/hcc-1.0/compiler/bin/clamp-device: line 140: 18860 Segmentation fault      (core dumped) $OPT -load $LIB/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o $2.promote.bc < $1
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx900
clang-5.0: error:  command failed with exit code 139 (use -v to see invocation)
Died at /opt/rocm/hip/bin/hipcc line 452.
Makefile:19: recipe for target 'square' failed
make: *** [square] Error 139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants