-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel oops when running hip kernel with dev branch ROCR/ROCK #15
Comments
Hi Matt, please try setting env var HIP_PLATFORM to "hcc" so hip will recognize the nanos. On Mar 23, 2016, at 11:36 PM, Matthew Macy <notifications@github.commailto:notifications@github.com> wrote: I was able to do the tutorial on gpuopen.comhttp://gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here: /*
But I can't even get it to compile: I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc: You are receiving this because you are subscribed to this thread. |
I see - that tells it which compiler to use. hipcc square.cpp That doesn't work so well. I've installed the most recent .deb from https://bitbucket.org/multicoreware/hcc/downloads. I see. Their latest .deb is 16045. Your sources require 16074 or later. I'm trying the following to see if I get a working hcc: |
Progress. I'm running 16124. It looks like you're out of sync with hsa: |
I don't know what the situation is with the ROCR_V2 API. The async memcpy in what I assume is the canonical hsa_ext_amd.h: I made the following changes to hip_hcc.cpp to get my square.cpp to compile using hcc as the HIP_PLATFORM:
|
Hi Matthew, can you try switch to "dev" branch on both ROCK-Kernel-Driver and ROCR-Runtime? You shall be able to find newer async_copy API which works with HIP over there. |
What do I do to just re-build the driver? |
And for that matter - how do I rebuild the runtime. There's no makefile in the root. |
Hi Matthew, you don't need to build them. On "dev" branch of ROCK-Kernel-Driver you can find a "package" directory which has ubuntu & fedora packages inside. And you can also find pre-built packages under "package" directory in ROCR-Runtime. Please do remember to switch to "dev" branch on both repositories though. |
OK. Great. Thanks. I'll do that in the morning and let you know how that goes. In the meantime the patched version works for me. I do notice that AMD kernels are much slower than Nvidia kernels: mmacy@pandemonium:~/devel/HIP/samples/0_Intro/square$ time !! real 0m1.203s mmacy@pandemonium:~/devel/HIP.old/samples/0_Intro/square$ time ./square.hip.out real 0m0.273s Is that fundamental? Or does your job dispatch interface just need refinement? Thanks. |
Hi Matthew, there are many ongoing works to optimize all aspects of the stack. Please stay tuned. :) |
OK. I updated both the kernel and the runtime to the 316 build. When running the square.cpp example with HIP_PLATFORM=hcc (nvcc still works fine) I now get a kernel oops: Mar 24 11:28:34 pandemonium kernel: [ 639.895604] nvidia_uvm: Loaded the UVM driver, major device number 245 Should I go back to the 1/25 version of driver/runtime with my local patch or is this likely to be fixed? |
I created an issue in with ROCK as that is probably where the current problem belongs. |
Hi, If your sample is not passing, ROCR or ROCK is not working as it should be. If it pass, get compiler (HCC and LLVM), follow https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra. Make sure you run conformance test given in the wiki for the repo. Then, add /opt/hsa to HSA_PATH, /opt/hcc to HCC_PATH. Do the same for adding bin directories to PATH and lib to LD_LIBRARY_PATH. Get hip and add its project directory to HIP_PATH and hipcc directory to PATH. |
See previous comment "OK. I updated both the kernel and the runtime to the 316 build." That's the dev kernel. I also installed the dev runtime so that hip_hcc.cpp will compile with the ROCR_V2 copy interface. And that is what is causing this panic. |
My sample passed fine until I tried the latest kernel and runtime. So all the other options are correct. |
Can you try running hsa sample? |
I'm no longer able to boot the dev kernel. It also complains of not properly detecting my graphics hardware - so needs to run in low-resolution, but instead never displays a login prompt. I'm not sure what I need to do to recover at this point. The default ubuntu kernel still works OK. |
Looking at the logs It seems I'm seeing further OOPS at boot now: Mar 24 11:53:48 pandemonium nvidia-persistenced: Started (1640) And so on for all cpus. |
Do you have the GTX 980Ti, the R9 Nano and the R9 Fury all installed in the same system? If so, did you install the drivers for the GTX card before or after you installed the ROCK packages? |
They're all in the same system. I installed the GTX card a couple of weeks ago. The R9s date back to yesterday. I have made no changes to the Nvidia software/hardware configuration in a couple of weeks - i.e. well before doing anything with AMD. |
The current status AFAICT is that the development driver won't work except in console-mode because Xorg's probing causes it to crash. So can anyone give me an ETA on when that will be fixed on github? Thanks. |
Hi, |
It's not clear to me where the problem was introduced. Can you hazard a guess at which changeset to try? The last time packages were updated was Jan 26th which corresponds to what's in master. So I'll need to build my own kernel - which is fine with me provided Kconfig is complete. |
Hi, |
Closing this since the original issue should be occurring anymore. @mattmacy Please try with a clean setup and reopen the issue if you face any problems. |
Is 'new' keyword supported? Malloc/free way work fine, but not new/delete. If lines 45-46 added, it compiler error is the following: ndr@ndr-ROCM16:~/Desktop/square/new$ make clean && make rm -f *.o square /opt/rocm/hip/bin/hipcc --amdgpu-target=gfx900 square.cpp -o square Referencing function in another module! %call6.i.i = tail call i8* @_Znam(i64 1024) ROCm#3 ; ModuleID = '<stdin>' i8* (i64)* @_Znam ; ModuleID = '#0 0x000000000142b5ea llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142b5ea) ROCm#1 0x000000000142968e llvm::sys::RunSignalHandlers() (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142968e) ROCm#2 0x00000000014297dc SignalHandler(int) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x14297dc) ROCm#3 0x00007f22f9e4c390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390) ROCm#4 0x0000000000f81eb9 void llvm::VerifierSupport::CheckFailed<llvm::Instruction*, llvm::Module const*, llvm::GlobalValue*, llvm::Module*>(llvm::Twine const&, llvm::Instruction* const&, llvm::Module const* const&, llvm::GlobalValue* const&, llvm::Module* const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf81eb9) ROCm#5 0x0000000000f8c8bc (anonymous namespace)::Verifier::visitInstruction(llvm::Instruction&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8c8bc) ROCm#6 0x0000000000f8f7b2 (anonymous namespace)::Verifier::verifyCallSite(llvm::CallSite) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8f7b2) #7 0x0000000000f919f5 (anonymous namespace)::Verifier::visitCallInst(llvm::CallInst&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf919f5) #8 0x0000000000f95381 llvm::InstVisitor<(anonymous namespace)::Verifier, void>::visit(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf95381) #9 0x0000000000f97264 (anonymous namespace)::Verifier::verify(llvm::Function const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf97264) ROCm#10 0x0000000000f9831d (anonymous namespace)::VerifierLegacyPass::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf9831d) ROCm#11 0x0000000000f4459a llvm::FPPassManager::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf4459a) ROCm#12 0x0000000000f44643 llvm::FPPassManager::runOnModule(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44643) ROCm#13 0x0000000000f44104 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44104) ROCm#14 0x0000000000643b74 main (/opt/rocm/hcc-1.0/compiler/bin/opt+0x643b74) ROCm#15 0x00007f22f8ba9830 __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:325:0 ROCm#16 0x000000000068f729 _start (/opt/rocm/hcc-1.0/compiler/bin/opt+0x68f729) Stack dump: 0. Program arguments: /opt/rocm/hcc-1.0/compiler/bin/opt -load /opt/rocm/hcc-1.0/compiler/bin/../lib/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o /tmp/tmp.vqMJlUNjk9/kernel-gfx900.hsaco.promote.bc 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'Module Verifier' on function '@_ZZ4mainEN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_419__cxxamp_trampolineEPfS0_m' /opt/rocm/hcc-1.0/compiler/bin/clamp-device: line 140: 18412 Segmentation fault (core dumped) $OPT -load $LIB/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o $2.promote.bc < $1 Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx900 /opt/rocm/hcc/bin/hcc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x1674f1a] /opt/rocm/hcc/bin/hcc(_ZN4llvm3sys17RunSignalHandlersEv+0x3e)[0x1672fbe] /opt/rocm/hcc/bin/hcc[0x167310c] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f69bbc98390] [0x7f69bc0c8a10] Stack dump: 0. Program arguments: /opt/rocm/hcc/bin/hcc -hc -D__HIPCC__ -I/opt/rocm/hcc/include -I/opt/rocm/hip/include/hip/hcc_detail/cuda -I/opt/rocm/hsa/include -Wno-deprecated-register -I/opt/rocm/profiler/CXLActivityLogger/include -I/opt/rocm/hip/include -DHIP_VERSION_MAJOR=1 -DHIP_VERSION_MINOR=2 -DHIP_VERSION_PATCH=17284 -D__HIP_ARCH_GFX900__=1 -Wl,--rpath=/opt/rocm/hip/lib /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a -hc -std=c++amp -L/opt/rocm/hcc-1.0/lib -Wl,--rpath=/opt/rocm/hcc-1.0/lib -ldl -lm -lpthread -lunwind -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive -lsupc++ -L/opt/rocm/hsa/lib -L/opt/rocm/lib -lhsa-runtime64 -lhc_am -lhsakmt -L/opt/rocm/profiler/CXLActivityLogger/bin/x86_64 -lCXLActivityLogger -Wl,--rpath=/opt/rocm/profiler/CXLActivityLogger/bin/x86_64 -lm --amdgpu-target=gfx900 --amdgpu-target=gfx900 square.cpp -o square Died at /opt/rocm/hip/bin/hipcc line 452. Makefile:19: recipe for target 'square' failed make: *** [square] Error 255 With delete [] , the error is ndr@ndr-ROCM16:~/Desktop/square/new$ make clean && make rm -f *.o square /opt/rocm/hip/bin/hipcc --amdgpu-target=gfx900 square.cpp -o square Referencing function in another module! %call6.i.i = tail call i8* @_Znam(i64 1024) ROCm#3 ; ModuleID = '<stdin>' i8* (i64)* @_Znam ; ModuleID = '#0 0x000000000142b5ea llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142b5ea) ROCm#1 0x000000000142968e llvm::sys::RunSignalHandlers() (/opt/rocm/hcc-1.0/compiler/bin/opt+0x142968e) ROCm#2 0x00000000014297dc SignalHandler(int) (/opt/rocm/hcc-1.0/compiler/bin/opt+0x14297dc) ROCm#3 0x00007f84d4a09390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390) ROCm#4 0x0000000000f81eb9 void llvm::VerifierSupport::CheckFailed<llvm::Instruction*, llvm::Module const*, llvm::GlobalValue*, llvm::Module*>(llvm::Twine const&, llvm::Instruction* const&, llvm::Module const* const&, llvm::GlobalValue* const&, llvm::Module* const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf81eb9) ROCm#5 0x0000000000f8c8bc (anonymous namespace)::Verifier::visitInstruction(llvm::Instruction&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8c8bc) ROCm#6 0x0000000000f8f7b2 (anonymous namespace)::Verifier::verifyCallSite(llvm::CallSite) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf8f7b2) #7 0x0000000000f919f5 (anonymous namespace)::Verifier::visitCallInst(llvm::CallInst&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf919f5) #8 0x0000000000f95381 llvm::InstVisitor<(anonymous namespace)::Verifier, void>::visit(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf95381) #9 0x0000000000f97264 (anonymous namespace)::Verifier::verify(llvm::Function const&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf97264) ROCm#10 0x0000000000f9831d (anonymous namespace)::VerifierLegacyPass::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf9831d) ROCm#11 0x0000000000f4459a llvm::FPPassManager::runOnFunction(llvm::Function&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf4459a) ROCm#12 0x0000000000f44643 llvm::FPPassManager::runOnModule(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44643) ROCm#13 0x0000000000f44104 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/rocm/hcc-1.0/compiler/bin/opt+0xf44104) ROCm#14 0x0000000000643b74 main (/opt/rocm/hcc-1.0/compiler/bin/opt+0x643b74) ROCm#15 0x00007f84d3766830 __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:325:0 ROCm#16 0x000000000068f729 _start (/opt/rocm/hcc-1.0/compiler/bin/opt+0x68f729) Stack dump: 0. Program arguments: /opt/rocm/hcc-1.0/compiler/bin/opt -load /opt/rocm/hcc-1.0/compiler/bin/../lib/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o /tmp/tmp.LeiH3VuY4Q/kernel-gfx900.hsaco.promote.bc 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'Module Verifier' on function '@_ZZ4mainEN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_419__cxxamp_trampolineEPfS0_m' /opt/rocm/hcc-1.0/compiler/bin/clamp-device: line 140: 18860 Segmentation fault (core dumped) $OPT -load $LIB/LLVMEraseNonkernel.so -inline -inline-threshold=1048576 -erase-nonkernels -dce -globaldce -o $2.promote.bc < $1 Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx900 clang-5.0: error: command failed with exit code 139 (use -v to see invocation) Died at /opt/rocm/hip/bin/hipcc line 452. Makefile:19: recipe for target 'square' failed make: *** [square] Error 139
I was able to do the tutorial on gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:
But I can't even get it to compile:
~/devel/HIP2$ make
./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o
src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)
#error (USE_AM_TRACKER requries HCC version of 16074 or newer)
^
Died at ./bin/hipcc line 208.
Makefile:20: recipe for target 'src/hip_hcc.o' failed
make: *** [src/hip_hcc.o] Error 1
I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc:
mmacy@pandemonium:~/devel/HIP2$ hipcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
The text was updated successfully, but these errors were encountered: