Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUDA for LLVM 5 and 6 using LLVM's NVPTX backend #299

Merged
merged 11 commits into from
Sep 20, 2018

Conversation

ProfFan
Copy link
Contributor

@ProfFan ProfFan commented Aug 24, 2018

This PR should be able to enable CUDA for LLVM 5 and 6 :)

cudaagg.t:
[CUDA] Result size: 2352

cudaaggregate.t:
[CUDA] Result size: 2420

cudaatomic.t:
[CUDA] Result size: 336

CUDA Module:
; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-unknown-cuda"

; Function Attrs: alwaysinline
define void @bar(i32*) #0 {
entry:
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  call void asm sideeffect "red.global.max.u32 [$0], $1;", "l,r"(i32* %0, i32 %1)
  ret void
}

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #2

attributes #0 = { alwaysinline }
attributes #1 = { nounwind readnone }
attributes #2 = { nounwind }

!nvvm.annotations = !{!0}

!0 = !{void (i32*)* @bar, !"kernel", i32 1}
Generated PTX:
//
// Generated by LLVM NVPTX Back-End
//

.version 5.0
.target sm_61
.address_size 64

	// .globl	bar

.visible .entry bar(
	.param .u64 bar_param_0
)
{
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<2>;

	ld.param.u64 	%rd1, [bar_param_0];
	mov.u32 	%r1, %tid.x;
	// begin inline asm
	red.global.max.u32 [%rd1], %r1;
	// end inline asm
	ret;
}



cudaconst2.t:
[CUDA] Result size: 214

cudaglobal.t:
[CUDA] Result size: 708

cudahello.t:
[CUDA] Result size: 953

0
1
2
and were done
cudaoffline.t:
[CUDA] Result size: 708

cudaoo.t:
[CUDA] Result size: 144

cudaprintf.t:
[CUDA] Result size: 1249

a = 0, b = 1.000000, c = 2
a = 1, b = 2.000000, c = 3
a = 2, b = 3.000000, c = 4
cudashared.t:
[CUDA] Result size: 690

CUDA Module:
; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-unknown-cuda"

@"_$60_global_$62_" = addrspace(3) global [1024 x i32] undef

; Function Attrs: alwaysinline
define void @bar(i32*) #0 {
entry:
  %1 = addrspacecast i32* %0 to i32 addrspace(1)*
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = call i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)* bitcast ([1024 x i32] addrspace(3)* @"_$60_global_$62_" to i8 addrspace(3)*))
  %4 = bitcast i8* %3 to i32*
  %5 = sext i32 %2 to i64
  %6 = getelementptr i32, i32* %4, i64 %5
  store i32 %2, i32* %6
  call void @llvm.nvvm.barrier0()
  %7 = sub i32 1023, %2
  %8 = sext i32 %7 to i64
  %9 = getelementptr i32, i32* %4, i64 %8
  %10 = load i32, i32* %9
  %11 = getelementptr i32, i32 addrspace(1)* %1, i64 %5
  store i32 %10, i32 addrspace(1)* %11
  ret void
}

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1

; Function Attrs: nounwind readnone
declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) #1

; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #2

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #3

attributes #0 = { alwaysinline }
attributes #1 = { nounwind readnone }
attributes #2 = { convergent nounwind }
attributes #3 = { nounwind }

!nvvm.annotations = !{!0}

!0 = !{void (i32*)* @bar, !"kernel", i32 1}
Generated PTX:
//
// Generated by LLVM NVPTX Back-End
//

.version 5.0
.target sm_61
.address_size 64

	// .globl	bar
.visible .shared .align 4 .b8 _$60_global_$62_[4096];

.visible .entry bar(
	.param .u64 bar_param_0
)
{
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<10>;

	ld.param.u64 	%rd1, [bar_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	mov.u32 	%r1, %tid.x;
	mov.u64 	%rd3, _$60_global_$62_;
	cvta.shared.u64 	%rd4, %rd3;
	mul.wide.s32 	%rd5, %r1, 4;
	add.s64 	%rd6, %rd4, %rd5;
	st.u32 	[%rd6], %r1;
	bar.sync 	0;
	mov.u32 	%r2, 1023;
	sub.s32 	%r3, %r2, %r1;
	mul.wide.s32 	%rd7, %r3, 4;
	add.s64 	%rd8, %rd4, %rd7;
	ld.u32 	%r4, [%rd8];
	add.s64 	%rd9, %rd2, %rd5;
	st.global.u32 	[%rd9], %r4;
	ret;
}



cudatest.t:
[CUDA] Result size: 930

cudatex.t:
[CUDA] Result size: 1147

0.000000
1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
and were done
=================

521 tests passed. 0 tests failed.

Tested on my Mac running:

LLVM_CONFIG=/usr/local/Cellar/llvm/6.0.1/bin/llvm-config
ENABLE_CUDA=1
CUDA_HOME=/Developer/NVIDIA/CUDA-9.2

@elliottslaughter
Copy link
Member

Dude, you're awesome. Will try to get around to testing this in the next day or so.

In the mean time, would it be possible to add LLVM 5 to the Travis tests (without CUDA), assuming that's a configuration that's supposed to work now? I tried LLVM 5 on the previous PR and it seemed that it didn't quite work.

@elliottslaughter
Copy link
Member

All tests pass on Ubuntu 16.04 with CUDA 9.2 and LLVM 6.0. Very nice.

@elliottslaughter
Copy link
Member

LLVM 5.0 does not build for me, getting errors like:

src/llvmheaders.h:70:2: error: #error "unsupported LLVM version"
 #error "unsupported LLVM version"
  ^

Is it hard to get LLVM 5 working? I'm not sure how much it matters, but if it's easy and you don't mind doing it, it would be nice to have.

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 25, 2018

Sorry, forgot to cherry-pick the commit into PR branch :) Should be fixed now :)

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 25, 2018

LLVM 5.0 is spilling segfaults on macOS. The reference counting logic is broken IMHO. Blacklisted for now.

@elliottslaughter
Copy link
Member

Sorry for the delay, I'm hoping to get someone familiar with Regent CUDA support to look at this, and the people I've contacted are busy until later this week. (Regent is relevant here because it's just one of the bigger users of Terra, and it happens to provide CUDA support.)

At the moment if I run a simple CUDA example with Regent I get the following. It may be that this is entirely Regent's fault, but I'd just like to double check in case there might be some problem with how the new CUDA support works in Terra.

$ LLVM_CONFIG=llvm-config-6.0 CLANG=clang-6.0 CUDA="$CUDA_HOME" ./install.py --debug --cuda
$ ./regent.py examples/circuit_bishop.rg -fcuda 1 -ll:gpu 1
warning: Linking two modules of different data layouts: '/usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is '' whereas 'external' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128'

terra: src/tcompiler.cpp:642: llvm::StructType* Types::CreateStruct(Obj*): Assertion `df' failed.

And the backtrace is:

#0  0x00007ffff6931428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff693302a in __GI_abort () at abort.c:89
#2  0x00007ffff6929bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x32399e0 "df", file=file@entry=0x32396ee "src/tcompiler.cpp", 
    line=line@entry=642, function=function@entry=0x323ca00 <Types::CreateStruct(Obj*)::__PRETTY_FUNCTION__> "llvm::StructType* Types::CreateStruct(Obj*)")
    at assert.c:92
#3  0x00007ffff6929c82 in __GI___assert_fail (assertion=0x32399e0 "df", file=0x32396ee "src/tcompiler.cpp", line=642, 
    function=0x323ca00 <Types::CreateStruct(Obj*)::__PRETTY_FUNCTION__> "llvm::StructType* Types::CreateStruct(Obj*)") at assert.c:101
#4  0x00000000009b956a in Types::CreateStruct (this=0x7fffffffd930, typ=0x7fffffffcb10) at src/tcompiler.cpp:642
#5  0x00000000009b8f66 in Types::GetIncomplete (this=0x7fffffffd930, typ=0x7fffffffcb10) at src/tcompiler.cpp:564
#6  0x00000000009ba072 in Types::Get (this=0x7fffffffd930, typ=0x7fffffffcb10) at src/tcompiler.cpp:722
#7  0x00000000009b9b7b in Types::LayoutStruct (this=0x7fffffffd930, st=0x58ffe30, typ=0x7fffffffcc90) at src/tcompiler.cpp:676
#8  0x00000000009ba0fa in Types::Get (this=0x7fffffffd930, typ=0x7fffffffcc90) at src/tcompiler.cpp:727
#9  0x00000000009b9b7b in Types::LayoutStruct (this=0x7fffffffd930, st=0x58ffe10, typ=0x7fffffffcfb0) at src/tcompiler.cpp:676
#10 0x00000000009ba0fa in Types::Get (this=0x7fffffffd930, typ=0x7fffffffcfb0) at src/tcompiler.cpp:727
#11 0x00000000009bd73d in FunctionEmitter::getType (this=0x7fffffffd770, v=0x7fffffffcfb0) at src/tcompiler.cpp:1472
#12 0x00000000009c0bc8 in FunctionEmitter::emitExpRaw (this=0x7fffffffd770, exp=0x7fffffffd0f0) at src/tcompiler.cpp:2032
#13 0x00000000009bfb5c in FunctionEmitter::emitExp (this=0x7fffffffd770, exp=0x7fffffffd0f0, loadlvalue=true) at src/tcompiler.cpp:1836
#14 0x00000000009c2de9 in FunctionEmitter::emitExpressionList (this=0x7fffffffd770, exps=0x7fffffffd290, loadlvalue=true, results=0x7fffffffd300)
    at src/tcompiler.cpp:2380
#15 0x00000000009c4845 in FunctionEmitter::emitStmt (this=0x7fffffffd770, stmt=0x7fffffffd360) at src/tcompiler.cpp:2616
#16 0x00000000009c304c in FunctionEmitter::emitStmtList (this=0x7fffffffd770, stmts=0x7fffffffd550) at src/tcompiler.cpp:2398
#17 0x00000000009c36dd in FunctionEmitter::emitStmt (this=0x7fffffffd770, stmt=0x7fffffffd600) at src/tcompiler.cpp:2467
#18 0x00000000009bd50a in FunctionEmitter::emitBody (this=0x7fffffffd770) at src/tcompiler.cpp:1444
#19 0x00000000009bcb69 in FunctionEmitter::emitFunction (this=0x7fffffffd770, funcobj_=0x7fffffffd760) at src/tcompiler.cpp:1355
#20 0x00000000009ae21b in EmitFunction (CU=0x5771d80, funcdecl=0x7fffffffd920, user=0x0) at src/tcompiler.cpp:2656
#21 0x00000000009ae553 in terra_compilationunitaddvalue (L=0x40000378) at src/tcompiler.cpp:2685
#22 0x0000000003235c36 in lj_BC_FUNCC ()
#23 0x00000000031f6210 in lua_pcall ()
#24 0x00000000009aa378 in docall (L=0x40000378, narg=8, clear=0) at src/main.cpp:332
#25 0x00000000009a997a in main (argc=10, argv=0x7fffffffdbd8) at src/main.cpp:109

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 29, 2018

@elliottslaughter I observed the same on Opt. It seems to be an issue with anon functions. Here is a log (poorly-formatted)

$ INCLUDE_PATH="/Developer/NVIDIA/CUDA-9.2/include" LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH ./image_warping

width 512, height 512
numActivePixels: 101406
[dofile] 0 CODE #include <stdio.h>
#include <stdlib.h>

[mangle] fopen=_fopen
[mangle] fputs=_fputs
[mangle] freopen=_freopen
[mangle] fwrite=_fwrite
[mangle] fdopen=_fdopen
[mangle] popen=_popen
[mangle] tempnam=_tempnam
[mangle] getrlimit=_getrlimit
[mangle] setrlimit=_setrlimit
[mangle] wait=_wait
[mangle] waitpid=_waitpid
[mangle] waitid=_waitid
[mangle] strtod=_strtod
[mangle] strtof=_strtof
[mangle] system=_system
[mangle] putenv=_putenv
[mangle] realpath=_realpath$DARWIN_EXTSN
[mangle] setenv=_setenv
[mangle] setkey=_setkey
[mangle] unsetenv=_unsetenv
[mangle] daemon=_daemon$1050
[dofile] 1 CODE #include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#ifdef _WIN32
	#include <io.h>
#endif

[mangle] fopen=_fopen
[mangle] fputs=_fputs
[mangle] freopen=_freopen
[mangle] fwrite=_fwrite
[mangle] fdopen=_fdopen
[mangle] popen=_popen
[mangle] tempnam=_tempnam
[mangle] strerror=_strerror
[mangle] getrlimit=_getrlimit
[mangle] setrlimit=_setrlimit
[mangle] wait=_wait
[mangle] waitpid=_waitpid
[mangle] waitid=_waitid
[mangle] strtod=_strtod
[mangle] strtof=_strtof
[mangle] system=_system
[mangle] putenv=_putenv
[mangle] realpath=_realpath$DARWIN_EXTSN
[mangle] setenv=_setenv
[mangle] setkey=_setkey
[mangle] unsetenv=_unsetenv
[mangle] daemon=_daemon$1050
[LUA AddValue]
Key: 	Opt/API/src/util.t:118: terra deviceMajorComputeCapability() : int32
Opt/API/src/util.t:119:     var deviceID : int32
Opt/API/src/util.t:120:     cudaGetDevice(&deviceID)
Opt/API/src/util.t:121:     var majorComputeCapability : int32
Opt/API/src/util.t:122:     cudaDeviceGetAttribute(&majorComputeCapability, [uint32](75), deviceID)
Opt/API/src/util.t:123:     return majorComputeCapability
Opt/API/src/util.t:118: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nildeviceMajorComputeCapability
[emitFunction] fstate: extern: nilcudaGetDevice
[emitFunction] fstate: extern: nilcudaDeviceGetAttribute
warning: Linking two modules of different data layouts: '/usr/local/cuda/nvvm/libdevice/libdevice.compute_30.10.bc' is '' whereas 'external' is 'e-m:o-i64:64-f80:128-n8:16:32:64-S128'

[dofile] 2 CODE #include "math.h"

[LUA AddValue]
Key: 	g/Opt/API/src/ad.t:391: terra self.impl($232 : float) : float
                            return [float](log([double]($232)))
                        end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilself.impl
[emitFunction] fstate: extern: nillog
Using Opt v0.2.2
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2536: terra opt.PlanFree(plan : &opt.Plan) : {}
g/Opt/API/src/o.t:2537:     (@plan).free((@plan).data)
g/Opt/API/src/o.t:2538:     <constant:&{&opt.Plan} -> {}>(plan)
[string "/std.t"]:67:       anon ([string "/std.t"]:78)(&@plan)
g/Opt/API/src/o.t:2536: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.PlanFree
[emitFunction] fstate: nilanon ([string "/std.t"]:78)
[emitFunction] fstate: nilanon ([string "/std.t"]:95)
[emitFunction] fstate: extern: nilfree
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2556: terra opt.SetSolverParameter(plan : &opt.Plan,name : &int8,value : &opaque) : {}
g/Opt/API/src/o.t:2557:     return (@plan).setsolverparameter((@plan).data, name, value)
g/Opt/API/src/o.t:2556: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.SetSolverParameter
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2552: terra opt.ProblemCurrentCost(plan : &opt.Plan) : double
g/Opt/API/src/o.t:2553:     return (@plan).cost((@plan).data)
g/Opt/API/src/o.t:2552: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemCurrentCost
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2526: terra opt.ProblemDelete(p : &opt.Problem) : {}
g/Opt/API/src/o.t:2527:     var id : int32 = [int32]([int64](p))
g/Opt/API/src/o.t:2528:     <constant:&{int32} -> {}>(id)
g/Opt/API/src/o.t:2526: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemDelete
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2542: terra opt.ProblemInit(plan : &opt.Plan,params : &&opaque) : {}
g/Opt/API/src/o.t:2543:     return (@plan).init((@plan).data, params)
g/Opt/API/src/o.t:2542: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemInit
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2548: terra opt.ProblemSolve(plan : &opt.Plan,params : &&opaque) : {}
g/Opt/API/src/o.t:2549:     opt.ProblemInit(plan, params)
g/Opt/API/src/o.t:2550:     while opt.ProblemStep(plan, params) ~= 0 do
                            end
g/Opt/API/src/o.t:2548: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemSolve
[emitFunction] [emitFunction] fstate: nilopt.ProblemStep
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2545: terra opt.ProblemStep(plan : &opt.Plan,params : &&opaque) : int32
g/Opt/API/src/o.t:2546:     return (@plan).step((@plan).data, params)
g/Opt/API/src/o.t:2545: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] [LUA AddValue]
Key: 	g/Opt/API/src/o.t:2530: terra opt.ProblemPlan(problem : &opt.Problem,dimensions : &uint32) : &opt.Plan
g/Opt/API/src/o.t:2531:     var p : &opt.Plan = [&opt.Plan](nil)
g/Opt/API/src/o.t:2532:     <constant:&{int32,&uint32,&&opt.Plan} -> {}>([int32]([int64](problem)), dimensions, &p)
g/Opt/API/src/o.t:2533:     return p
g/Opt/API/src/o.t:2530: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemPlan
[LUA AddValue]
Key: 	g/Opt/API/src/o.t:2521: terra opt.ProblemDefine(filename : &int8,kind : &int8) : &opt.Problem
g/Opt/API/src/o.t:2522:     var id : int32
g/Opt/API/src/o.t:2523:     <constant:&{&int8,&int8,&int32} -> {}>(filename, kind, &id)
g/Opt/API/src/o.t:2524:     return [&opt.Problem](id)
g/Opt/API/src/o.t:2521: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilopt.ProblemDefine
[LUA AddValue]
Key: 	g/Opt/API/src/ad.t:391: terra self.impl($234 : float,$235 : float) : float
                            return [float](pow([double]($234), [double]($235)))
                        end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilself.impl
[emitFunction] fstate: extern: nilpow
nUnknowns = 	786432
nResiduals = 	0 + 262144 * 10

nnz = 	0 + 262144 * 26

[LUA AddValue]
Key: 	string "<string>"]:116: terra initcuda(CX : &&CUctx_st,D : &int32,version : &uint64,$145 : &int8,$146 : uint64) : uint32
string "<string>"]:118:     if $146 > [uint64](0) then
                                $145[0] = [int8](0)
                            end
string "<string>"]:102:     var r : uint32 = cuInit([uint32](0))
string "<string>"]:103:     if r ~= [uint32](0) then
string "<string>"]:104:         if $145 ~= [&int8](nil) then
string "<string>"]:105:             var start : uint64 = strlen($145)
string "<string>"]:106:             if $146 - start > [uint64](0) then
string "<string>"]:107:                 var s : &int8
string "<string>"]:108:                 cuGetErrorString(r, &s)
string "<string>"]:109:                 snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuInit", r, s)
string "<string>"]:106:             end
string "<string>"]:104:         end
string "<string>"]:112:         return r
string "<string>"]:103:     end
string "<string>"]:102:     var r$1 : uint32 = cuCtxGetCurrent(CX)
string "<string>"]:103:     if r$1 ~= [uint32](0) then
string "<string>"]:104:         if $145 ~= [&int8](nil) then
string "<string>"]:105:             var start : uint64 = strlen($145)
string "<string>"]:106:             if $146 - start > [uint64](0) then
string "<string>"]:107:                 var s : &int8
string "<string>"]:108:                 cuGetErrorString(r$1, &s)
string "<string>"]:109:                 snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuCtxGetCurrent", r$1, s)
string "<string>"]:106:             end
string "<string>"]:104:         end
string "<string>"]:112:         return r$1
string "<string>"]:103:     end
string "<string>"]:121:     if @CX ~= [&CUctx_st](nil) then
string "<string>"]:102:         var r$2 : uint32 = cuCtxGetDevice(D)
string "<string>"]:103:         if r$2 ~= [uint32](0) then
string "<string>"]:104:             if $145 ~= [&int8](nil) then
string "<string>"]:105:                 var start : uint64 = strlen($145)
string "<string>"]:106:                 if $146 - start > [uint64](0) then
string "<string>"]:107:                     var s : &int8
string "<string>"]:108:                     cuGetErrorString(r$2, &s)
string "<string>"]:109:                     snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuCtxGetDevice", r$2, s)
string "<string>"]:106:                 end
string "<string>"]:104:             end
string "<string>"]:112:             return r$2
string "<string>"]:103:         end
string "<string>"]:125:     else
string "<string>"]:102:         var r$2 : uint32 = cuDeviceGet(D, 0)
string "<string>"]:103:         if r$2 ~= [uint32](0) then
string "<string>"]:104:             if $145 ~= [&int8](nil) then
string "<string>"]:105:                 var start : uint64 = strlen($145)
string "<string>"]:106:                 if $146 - start > [uint64](0) then
string "<string>"]:107:                     var s : &int8
string "<string>"]:108:                     cuGetErrorString(r$2, &s)
string "<string>"]:109:                     snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuDeviceGet", r$2, s)
string "<string>"]:106:                 end
string "<string>"]:104:             end
string "<string>"]:112:             return r$2
string "<string>"]:103:         end
string "<string>"]:102:         var r$3 : uint32 = cuCtxCreate_v2(CX, [uint32](0), @D)
string "<string>"]:103:         if r$3 ~= [uint32](0) then
string "<string>"]:104:             if $145 ~= [&int8](nil) then
string "<string>"]:105:                 var start : uint64 = strlen($145)
string "<string>"]:106:                 if $146 - start > [uint64](0) then
string "<string>"]:107:                     var s : &int8
string "<string>"]:108:                     cuGetErrorString(r$3, &s)
string "<string>"]:109:                     snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuCtxCreate_v2", r$3, s)
string "<string>"]:106:                 end
string "<string>"]:104:             end
string "<string>"]:112:             return r$3
string "<string>"]:103:         end
string "<string>"]:121:     end
string "<string>"]:128:     var major : int32
                            var minor : int32
string "<string>"]:102:     var r$2 : uint32 = cuDeviceComputeCapability(&major, &minor, @D)
string "<string>"]:103:     if r$2 ~= [uint32](0) then
string "<string>"]:104:         if $145 ~= [&int8](nil) then
string "<string>"]:105:             var start : uint64 = strlen($145)
string "<string>"]:106:             if $146 - start > [uint64](0) then
string "<string>"]:107:                 var s : &int8
string "<string>"]:108:                 cuGetErrorString(r$2, &s)
string "<string>"]:109:                 snprintf($145 + start, $146 - start, "%s: cuda reported error %d: %s", "cuDeviceComputeCapability", r$2, s)
string "<string>"]:106:             end
string "<string>"]:104:         end
string "<string>"]:112:         return r$2
string "<string>"]:103:     end
string "<string>"]:130:     @version = [uint64](major * 10 + minor)
string "<string>"]:131:     return [uint32](0)
string "<string>"]:116: end

Val: 	nil
[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilinitcuda
[emitFunction] fstate: extern: nilcuInit
[emitFunction] fstate: extern: nilstrlen
[emitFunction] fstate: extern: nilcuGetErrorString
[emitFunction] fstate: extern: nilsnprintf
[emitFunction] fstate: extern: nilcuCtxGetCurrent
[emitFunction] [emitFunction] [emitFunction] [emitFunction] fstate: extern: nilcuCtxGetDevice
[emitFunction] [emitFunction] [emitFunction] [emitFunction] fstate: extern: nilcuDeviceGet
[emitFunction] [emitFunction] [emitFunction] [emitFunction] fstate: extern: nilcuCtxCreate_v2
[emitFunction] [emitFunction] [emitFunction] [emitFunction] fstate: extern: nilcuDeviceComputeCapability
[emitFunction] [emitFunction] [emitFunction] [CUDA AddValue Kern]
[LUA AddValue]
Key: 	PCGInit1_Finish_W_H_1535503461
Val: 	[string "<string>"]:72: terra anon ([string "<string>"]:72)($2728 : &{&opaque,&&opaque} -> {},$2729 : &{&opaque,&int8,&opaque} -> {},$2730 : &{&opaque} -> {},$2731 : &{&opaque,&&opaque} -> int32,$2732 : &{&opaque} -> double,$2733 : &opaque,$2734 : &float_2,$2735 : uint64,$2736 : &float_1,$2737 : uint64,$2738 : &float_2,$2739 : uint64,$2740 : &float_2,$2741 : uint64,$2742 : &float_1,$2743 : uint64,$2744 : float,$2745 : float,$2746 : float,$2747 : float,$2748 : float,$2749 : float,$2750 : float,$2751 : float,$2752 : float,$2753 : float,$2754 : float,$2755 : int32,$2756 : int32,$2757 : int32,$2758 : int32,$2759 : &float,$2760 : &float_2,$2761 : uint64,$2762 : &float_1,$2763 : uint64,$2764 : &float_2,$2765 : uint64,$2766 : &float_1,$2767 : uint64,$2768 : &float_2,$2769 : uint64,$2770 : &float_1,$2771 : uint64,$2772 : &float_2,$2773 : uint64,$2774 : &float_1,$2775 : uint64,$2776 : &float_2,$2777 : uint64,$2778 : &float_1,$2779 : uint64,$2780 : &float_2,$2781 : uint64,$2782 : &float_1,$2783 : uint64,$2784 : &float_2,$2785 : uint64,$2786 : &float_1,$2787 : uint64,$2788 : &float_2,$2789 : uint64,$2790 : &float_1,$2791 : uint64,$2792 : &float_2,$2793 : uint64,$2794 : &float_1,$2795 : uint64,$2796 : &float_2,$2797 : uint64,$2798 : &float_1,$2799 : uint64,$2800 : &float_2,$2801 : uint64,$2802 : &float_1,$2803 : uint64,$2804 : &float_2,$2805 : uint64,$2806 : &float_1,$2807 : uint64,$2808 : &float,$2809 : &float,$2810 : &float,$2811 : &float,$2812 : &float,$2813 : &Array(util.TimingInfo),$2814 : &CUevent_st,$2815 : float,$2816 : &float,$2817 : &int32,$2818 : &int32,$2819 : &float,$2820 : &int32,$2821 : &int32,$2822 : &float,$2823 : &int32,$2824 : &int32,$2825 : int32,$2826 : &float) : {}
[string "<string>"]:73:     var $2727 : PlanData
[string "<string>"]:74:     $2727.plan.init, $2727.plan.setsolverparameter, $2727.plan.free, $2727.plan.step, $2727.plan.cost, $2727.plan.data, $2727.parameters.X.Offset.data, $2727.parameters.X.Offset.tex, $2727.parameters.X.Angle.data, $2727.parameters.X.Angle.tex, $2727.parameters.UrShape.data, $2727.parameters.UrShape.tex, $2727.parameters.Constraints.data, $2727.parameters.Constraints.tex, $2727.parameters.Mask.data, $2727.parameters.Mask.tex, $2727.parameters.w_fitSqrt, $2727.parameters.w_regSqrt, $2727.solverparameters.min_relative_decrease, $2727.solverparameters.min_trust_region_radius, $2727.solverparameters.max_trust_region_radius, $2727.solverparameters.q_tolerance, $2727.solverparameters.function_tolerance, $2727.solverparameters.trust_region_radius, $2727.solverparameters.radius_decrease_factor, $2727.solverparameters.min_lm_diagonal, $2727.solverparameters.max_lm_diagonal, $2727.solverparameters.residual_reset_period, $2727.solverparameters.nIter, $2727.solverparameters.nIterations, $2727.solverparameters.lIterations, $2727.scratch, $2727.delta.Offset.data, $2727.delta.Offset.tex, $2727.delta.Angle.data, $2727.delta.Angle.tex, $2727.r.Offset.data, $2727.r.Offset.tex, $2727.r.Angle.data, $2727.r.Angle.tex, $2727.b.Offset.data, $2727.b.Offset.tex, $2727.b.Angle.data, $2727.b.Angle.tex, $2727.Adelta.Offset.data, $2727.Adelta.Offset.tex, $2727.Adelta.Angle.data, $2727.Adelta.Angle.tex, $2727.z.Offset.data, $2727.z.Offset.tex, $2727.z.Angle.data, $2727.z.Angle.tex, $2727.p.Offset.data, $2727.p.Offset.tex, $2727.p.Angle.data, $2727.p.Angle.tex, $2727.Ap_X.Offset.data, $2727.Ap_X.Offset.tex, $2727.Ap_X.Angle.data, $2727.Ap_X.Angle.tex, $2727.CtC.Offset.data, $2727.CtC.Offset.tex, $2727.CtC.Angle.data, $2727.CtC.Angle.tex, $2727.preconditioner.Offset.data, $2727.preconditioner.Offset.tex, $2727.preconditioner.Angle.data, $2727.preconditioner.Angle.tex, $2727.SSq.Offset.data, $2727.SSq.Offset.tex, $2727.SSq.Angle.data, $2727.SSq.Angle.tex, $2727.g.Offset.data, $2727.g.Offset.tex, $2727.g.Angle.data, $2727.g.Angle.tex, $2727.prevX.Offset.data, $2727.prevX.Offset.tex, $2727.prevX.Angle.data, $2727.prevX.Angle.tex, $2727.scanAlphaNumerator, $2727.scanAlphaDenominator, $2727.scanBetaNumerator, $2727.modelCost, $2727.q, $2727.timer.timingInfo, $2727.endSolver, $2727.prevCost, $2727.J_csrValA, $2727.J_csrColIndA, $2727.J_csrRowPtrA, $2727.JT_csrValA, $2727.JT_csrRowPtrA, $2727.JT_csrColIndA, $2727.JTJ_csrValA, $2727.JTJ_csrRowPtrA, $2727.JTJ_csrColIndA, $2727.JTJ_nnz, $2727.Jp = $2728, $2729, $2730, $2731, $2732, $2733, $2734, $2735, $2736, $2737, $2738, $2739, $2740, $2741, $2742, $2743, $2744, $2745, $2746, $2747, $2748, $2749, $2750, $2751, $2752, $2753, $2754, $2755, $2756, $2757, $2758, $2759, $2760, $2761, $2762, $2763, $2764, $2765, $2766, $2767, $2768, $2769, $2770, $2771, $2772, $2773, $2774, $2775, $2776, $2777, $2778, $2779, $2780, $2781, $2782, $2783, $2784, $2785, $2786, $2787, $2788, $2789, $2790, $2791, $2792, $2793, $2794, $2795, $2796, $2797, $2798, $2799, $2800, $2801, $2802, $2803, $2804, $2805, $2806, $2807, $2808, $2809, $2810, $2811, $2812, $2813, $2814, $2815, $2816, $2817, $2818, $2819, $2820, $2821, $2822, $2823, $2824, $2825, $2826
[string "<string>"]:75:     return kernels.PCGInit1_Finish($2727)
[string "<string>"]:72: end

[CUAddValue] Obj 63 emitted as function
[emitFunction] fstate: nilanon ([string "<string>"]:72)
__makeeverythinginclanglive_1
stack trace:
0   image_warping                       0x0000000107bfdb5a _ZL15printStackTraceP7__sFILEj + 122
1   image_warping                       0x0000000107bfc912 _ZN5Types12CreateStructEP3Obj + 162
2   image_warping                       0x0000000107bfb2c8 _ZN5Types13GetIncompleteEP3Obj + 792
3   image_warping                       0x0000000107bfb121 _ZN5Types13GetIncompleteEP3Obj + 369
4   image_warping                       0x0000000107bfa513 _ZN5Types3GetEP3Obj + 115
5   image_warping                       0x0000000107bf8de7 _ZN12CCallingConv16ClassifyArgumentEP3ObjPiS2_ + 103
6   image_warping                       0x0000000107bf8976 _ZN12CCallingConv8ClassifyEP3ObjS1_PNS_14ClassificationE + 422
7   image_warping                       0x0000000107bf84a0 _ZN12CCallingConv16ClassifyFunctionEP3Obj + 192
8   image_warping                       0x0000000107bf780c _ZN12CCallingConv14CreateFunctionEPN4llvm6ModuleEP3ObjRKNS0_5TwineE + 44
9   image_warping                       0x0000000107befc71 _ZN15FunctionEmitter12emitFunctionEP3Obj + 1201
10  image_warping                       0x0000000107bef591 _Z12EmitFunctionP20TerraCompilationUnitP3ObjP18TerraFunctionState + 113
11  image_warping                       0x0000000107bec06f _ZL29terra_compilationunitaddvalueP9lua_State + 1151
12  image_warping                       0x00000001094f0db6 lj_BC_FUNCC + 52
Assertion failed: (df), function CreateStruct, file src/tcompiler.cpp, line 680.

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 29, 2018

@elliottslaughter Any idea? It seems only crash when both k,v in compilationunit:addvalue(k,v) are not nil, so I suggest maybe this is another issue not limited to CUDA.

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 29, 2018

Also note that the line __makeeverythinginclanglive_1 is printed by:

StructType * CreateStruct(Obj * typ) {
        //check to see if it was initialized externally first
        if(typ->boolean("llvm_definingfunction")) {
            const char * name = typ->string("llvm_definingfunction");
            llvm::errs() << name << "\n"; // HERE
            printStackTrace();
            Function * df = CU->TT->external->getFunction(name); assert(df);
            int argpos = typ->number("llvm_argumentposition");
            StructType * st = cast<StructType>(df->getFunctionType()->getParamType(argpos)->getPointerElementType());
            assert(st);
            return st;
        }

What is the purpose of this __makeeverythinginclanglive_1? @zdevito

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 29, 2018

@zdevito I dumped the offending module:

; ModuleID = 'external'
source_filename = "external"
target triple = "nvptx64-nvidia-cuda"

And it is empty. Needs more investigation though.

@elliottslaughter
Copy link
Member

For what it's worth, without CUDA the Regent test suite passes on LLVM 6. So I'm not finding any evidence of a problem without CUDA at the moment.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 18, 2018

I finally found the cause of this issue! In tcompiler.cpp:

        if(typ->boolean("llvm_definingfunction")) {
            const char * name = typ->string("llvm_definingfunction");
            llvm::errs() << name << "\n";
            printStackTrace();
            CU->TT->external->print(llvm::errs(), nullptr);
            llvm::errs().flush();
            Function * df = CU->TT->external->getFunction(name); assert(df);
            int argpos = typ->number("llvm_argumentposition");
            StructType * st = cast<StructType>(df->getFunctionType()->getParamType(argpos)->getPointerElementType());
            assert(st);
            return st;
        }

This is not needed at all, as the symbols we need may not be in the external module for current session. If I remove this if, terra does not crash at the df assertion anymore. More tests is needed as I think the LLVM NVPTX backend have some incompatibility issues with the NVCC backend used by things like Regent and Opt.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 18, 2018

Also, I tested Opt with CUDA on LLVM 3.9, which runs fine, however not for 5 and 6. Suggest a problem in LLVM's PTX backend.

Opt/API/src/util.t:873: cuModuleLoadData: cuda reported error 218: a PTX JIT compilation failed
stack traceback:
	[C]: in function 'error'
	[string "<string>"]:250: in function 'cudacompile'
	...Projects/Development/Robotics/Optlang/Opt/API/src/util.t:873: in function 'makeGPUFunctions'
	...ment/Robotics/Optlang/Opt/API/src/solverGPUGaussNewton.t:751: in function 'compilePlan'
	...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:870: in function <...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:862>
	[C]: in function 'xpcall'
	...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:862: in function <...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:861>
Assertion failed: (m_plan), function OptSolver, file src/../../shared/OptSolver.h, line 56.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 18, 2018

After solving the df assert problem, the only(and biggest) outstanding standing problem surfaced:

LLVM's NVPTX backend is not able to handle relocatable code (RDC)!

We would probably have to wait for upstream (hopefully LLVM 9)

========= Program hit CUDA_ERROR_INVALID_PTX (error 218) due to "a PTX JIT compilation failed" on CUDA API call to cuModuleLoadDataEx.
Loading PTX...Projects/Development/Robotics/Optlang/Opt/API/src/util.t:873: ptxas fatal   : Unresolved extern function '__nv_sqrtf'cuModuleLoadDataEx: cuda reported error 218: a PTX JIT compilation failed
stack traceback:
	[C]: in function 'error'
	[string "<string>"]:256: in function 'cudacompile'
	...Projects/Development/Robotics/Optlang/Opt/API/src/util.t:873: in function 'makeGPUFunctions'
	...ment/Robotics/Optlang/Opt/API/src/solverGPUGaussNewton.t:751: in function 'compilePlan'
	...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:870: in function <...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:862>
	[C]: in function 'xpcall'
	...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:862: in function <...an/Projects/Development/Robotics/Optlang/Opt/API/src/o.t:861>
Assertion failed: (m_plan), function OptSolver, file src/../../shared/OptSolver.h, line 56.
========= Error: process didn't terminate successfully

See http://lists.llvm.org/pipermail/llvm-dev/2017-August/116868.html

This would essentially prevent any form of dynamic linking with pre-compiled code (i.e. all code have to be run-time generated in terra)

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 18, 2018

Good news is that I got the NVVM backend to work with Opt, but only partly. There are 3 possible situations:

  1. I make every kernel non-inline, which fixes all cuda tests except cudaoffline and cudashared
  2. I make every kernel inline (default), and Optlang works, however all cuda tests fail
  3. I switch to NVPTX backend, all cuda tests pass, but Optlang fails

Problem with NVCC is by default all kernels are inline, which makes them optimized out by NVVM, as there isn't a thing that uses them.

Awwwww.

@manopapad
Copy link
Contributor

manopapad commented Sep 18, 2018

I'm trying to follow the discussion on this issue and I'm a bit lost. Do you mind explaining some things for me @ProfFan?

Regarding #299 (comment), I am not familiar with an NVCC backend for LLVM, could you please explain further what you're doing?

Regarding #299 (comment), I don't understand why LLVM's support for relocatable CUDA code matters for us, if we're not trying to link against externally compiled CUDA libraries. The particular function you're trying to link against, __nv_sqrtf, should be present in libdevice, which is distributed as LLVM bitcode, and should be getting linked in automatically (see https://github.com/zdevito/terra/blob/master/src/tcuda.cpp#L135-L144 and https://github.com/zdevito/terra/blob/master/src/cudalib.lua#L77-L97).

Regarding #299 (comment), I remember coming across this code in the past. I had concluded that the reason for having llvm_definingfunction was to make sure all externally-defined types were visible, by creating a fake function that carried one argument of each such type. I had an issue with missing symbols that surfaced as the same df assertion failure as you came across, but the underlying reason for it was that I had just made a change to terra to use a different compilation module for CUDA code vs native code (previously terra used the same module for both, which happened to work, until you wanted to run specialized NVPTX passes -- specifically to desugar some fake instructions from libdevice), and any types declared in a header included through terra.includec by default are added only to the native code module (so you can't use them in CUDA code).

Edit: wording fix

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 19, 2018

@manopapad Sorry for the confusion in my wording :D

For your first question, it is actually the NVVM backend (edited);

For the 2nd question, it will be linked by the NVVM backend, however when using the NTPTX backend, I am still finding a way to do the linking. There are SO posts indicating that this is an LLVM issue that made me a little bit confused. Sorry again for bad phrasing. (edited)

For the last question, I understand the purpose of have a extern function that uses all symbols to avoid the symbols being optimized out, however what that function does is just really no-op, so we only need it to be in the module, not at runtime, am I right?

Thank you for you kind comments.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 19, 2018

@manopapad So I rechecked and added the Linking pass. Now it works flawlessly :) (Tested with Optlang at current master, LLVM 6.0).

In retrospect: Initially ignored because terra's test suite does not use the libdevice math functions so the tests pass fine even there isn't a linking process at all!

@elliottslaughter Please test against other programs as well :)

@elliottslaughter
Copy link
Member

@ProfFan Will test soon, but is there a simple test we could do to catch the cases that weren't being tested before? (Is it just testing anything in libdevice at all?)

Thanks again for working on this!

@elliottslaughter
Copy link
Member

I do see some warnings when I run some simple Regent programs with CUDA, but despite these everything seems to work.

warning: Linking two modules of different data layouts: '/usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is '' whereas 'external' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128'

warning: Linking two modules of different target triples: /usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is 'nvptx-unknown-unknown' whereas 'terra' is 'nvptx64-nvidia-cuda'

Ubuntu 16.04, CUDA 9.2, LLVM 6.0.

@elliottslaughter
Copy link
Member

I'm happy to merge this, or if there's an easy fix for the warnings that's ok too. As far as the code goes everything looks fine to me.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 19, 2018

@elliottslaughter Yes it is easy to fix and harmless :) I just want to first make sure that it works though.

Just need to add some math functions to the current cuda*.t tests.

Will now start to fix the warnings. In the meanwhile please test the performance regression if possible.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 19, 2018

Fixed in 7740d0f. Please test :)

I do see some warnings when I run some simple Regent programs with CUDA, but despite these everything seems to work.

warning: Linking two modules of different data layouts: '/usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is '' whereas 'external' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128'

warning: Linking two modules of different target triples: /usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is 'nvptx-unknown-unknown' whereas 'terra' is 'nvptx64-nvidia-cuda'

Ubuntu 16.04, CUDA 9.2, LLVM 6.0.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 19, 2018

Some performance results on Opt:

//////////// ITERATION2  (Opt(GN)) ///////////////
cost: 15346.887695 -> 13415.850586
final cost=13415.850586
--------------------------------------------------------
        Kernel        |   Count  |   Total   | Average
----------------------+----------+-----------+----------
----------------------+----------+-----------+----------
 overall              |      1   |  257.675ms| 257.6747ms
----------------------+----------+-----------+----------
 computeCost_W_H      |      2   |    1.860ms|  0.9298ms
----------------------+----------+-----------+----------
 PCGInit1_W_H         |      1   |    1.963ms|  1.9628ms
----------------------+----------+-----------+----------
 PCGStep1_W_H         |     50   |   92.558ms|  1.8512ms
----------------------+----------+-----------+----------
 PCGStep2_W_H         |     50   |   81.790ms|  1.6358ms
----------------------+----------+-----------+----------
 PCGStep3_W_H         |     50   |   77.120ms|  1.5424ms
----------------------+----------+-----------+----------
 PCGLinearUpdate_W_H  |      1   |    1.520ms|  1.5196ms
--------------------------------------------------------
TIMING 257.674652 1.962848 92.558372
Per-iter times ms (nonlinear,linear): 261.1571	251.4680
===Optical Flow===
**Final Costs**
Opt GN,Opt LM,CERES
1.34158505859375000000e+04,,
GPU memory usage: used = 6337.781250, free = 1854.031250 MB, total = 8191.812500 MB
plan free complete
GPU memory usage: used = 6337.781250, free = 1854.031250 MB, total = 8191.812500 MB

@elliottslaughter
Copy link
Member

@manopapad Would you be willing to run Soleil and check to see if there are any performance regressions?

@elliottslaughter
Copy link
Member

@ProfFan Thanks, that fixed the first warning. I still see the second one, but again this doesn't appear to prevent it from actually running anything.

warning: Linking two modules of different data layouts: '/usr/local/cuda-9.2/nvvm/libdevice/libdevice.10.bc' is '' whereas 'external' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128'

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 20, 2018

@elliottslaughter This is because some libraries that uses terra actually links libdevice themselves by calling terra's terralib.linkllvm
for example in Opt:
image

We could add an optional param here indicating that the library being linked is a CUDA lib, or we can just add a linknvvm method and have all libs switch to new API.

@elliottslaughter
Copy link
Member

Given that it's just a warning, it's really not that big a deal. We can always fix it later if someone cares.

Thanks again for working on this, this is a really big help to those of us who depend on the Terra!

@elliottslaughter
Copy link
Member

@ProfFan Now that you've done all this work, LLVM went and released version 7. You've already done more than enough for us, but would you have any interest in checking out 7? I figure it's better to work on it now, while everything is still fresh, rather than waiting to be 3 or 4 versions behind again.

Either way, thanks again for all you've done.

@ProfFan
Copy link
Contributor Author

ProfFan commented Sep 21, 2018

@elliottslaughter It's my great pleasure :) And thanks to all the contributors to Terra, this is really an amazing piece of software!

For LLVM 7.0, let's keep up the pace and I will open a PR to get CI working first.

@Mx7f
Copy link
Contributor

Mx7f commented Oct 16, 2018

I just want to chime in (as the maintainer of Opt) and say that this is absolutely awesome, and to thank @ProfFan profusely.

@elliottslaughter: It'd be great for Opt if there were binary releases of terra which can handle newer versions CUDA. If I compiled a new version using LLVM 6.0 on the three major platforms, and wrote a summary of all changes since the last release, would you cut a new release? I'll have to track down an OS X machine with an NVIDIA graphics card...

@elliottslaughter
Copy link
Member

@Mx7f I agree a release would be good. My initial inclination was to build and package the release directly from Travis. I don't think it's a requirement to have a GPU, since you can install the CUDA compiler without any physical hardware. I just haven't gotten around to figuring out how to have Travis do the packaging yet.

There is also the issue of building on running on Windows. I have an AppVeyor build, but it only works on LLVM 3.5 and Visual Studio 2013. I'd really love to figure out how to get it working on more LLVM and Visual Studio versions, but haven't been able to spend any time on it. Based on the regular stream of issues we've been collecting from Windows users, I don't think this is just an AppVeyor problem, I think it's a real issue with our Windows build.

I had made partial progress on a "one CMake build system to rule them all" over at this PR: #297 . But it still has bugs I need to work out and again, I keep running out of time to actually do things.

If you'd like to take up one or more of those tasks, I'd be all for it. Even if not, we could potentially think about a release anyway, but we'd have to decide what level of support we can afford to provide moving forward if we can't put in the time to make all the platforms work well.

@Mx7f
Copy link
Contributor

Mx7f commented Oct 16, 2018

The current releases have a fixed LLVM version number; having a full (Travis+AppVeyor) matrix set of options would be great, but I think having a canonical release (LLVM 6.0, VS2015) makes sense if you are fine with it. If thats the case the main obstacle to release is a LLVM 6.0 VS2015 Windows build. I don't mind taking a crack at it on my home Windows machine.

@elliottslaughter
Copy link
Member

I agree, and I'm fine with using LLVM 6.0 exclusively for the next binary release. Travis and AppVeyor are mostly conveniences so that no one has to maintain build machines, and so that the release process is documented and repeatable.

If you're willing to look at LLVM 6.0 on Windows with any reasonably recent VS version, that would be a big step towards helping us get there.

@ProfFan
Copy link
Contributor Author

ProfFan commented Oct 17, 2018

@Mx7f Thank you for your appreciation :) Opt is a great piece of software and I have enjoyed it so much in my work. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants