Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using build tags instead of a bash script #125

Open
oliverdain opened this issue Jan 17, 2024 · 6 comments
Open

Consider using build tags instead of a bash script #125

oliverdain opened this issue Jan 17, 2024 · 6 comments

Comments

@oliverdain
Copy link

This is a really helpful project! I'm trying to get this working with our build but the install is very non-standard so I can just go get it and be done. We'd need to have every developer run the https://github.com/sugarme/gotch/releases/download/v0.9.0/setup-gotch.sh script in order to build the rest of our code base, even if they're not working with the PyTorch parts of it.

I took a look at the script and it seems like the bulk of what it's doing is changing the cgo compiler flags depending on if it's GPU or CPU. It seems like it might be easier for end users if you had two versions of lib.go and used build tags to determine which one gets built as described here. That way users could use your library like any other (assuming they have libtorch installed) and then all they'd have to do is add a -tags option to their go build command. For example, the go-sqlite3 library uses this approach.

If you like the idea I could send you a PR for it.

@sugarme
Copy link
Owner

sugarme commented Jan 17, 2024

@oliverdain ,

Thanks for suggestion. I was wondering how would it be in combination with libtorch installation for CPU and GPU or it just handles gotch installation only?

Feel free to PR. Please update installation guide in README.md file as well. Thanks.

@oliverdain
Copy link
Author

Hi @sugarme . I've started on this. I got libtorch installed for CPU and exported all the env vars, etc. go test ./... in the clone of this repo builds some, but not all packages, correctly and throws what look like real compilation errors to me. This is with libtorch 2.1.2 and with libtorch 2.1.0 (the latter being what's in your setup.sh files).

I see that you automatically generated many of the bindings but didn't see much info on how that was done. I think maybe they just need to be re-generated?

There's actually quite a few more failures than this but this should give you the idea:

$ go test ./...
?       github.com/sugarme/gotch        [no test files]
ok      github.com/sugarme/gotch/dutil  0.678s
?       github.com/sugarme/gotch/example/augmentation   [no test files]
?       github.com/sugarme/gotch/example/basic  [no test files]
?       github.com/sugarme/gotch/example/char-rnn       [no test files]
?       github.com/sugarme/gotch/example/cifar  [no test files]
?       github.com/sugarme/gotch/example/convert-model  [no test files]
?       github.com/sugarme/gotch/example/debug-memory   [no test files]
?       github.com/sugarme/gotch/example/jit    [no test files]
?       github.com/sugarme/gotch/example/jit-train      [no test files]
?       github.com/sugarme/gotch/example/mem    [no test files]
# github.com/sugarme/gotch/example/neural-style-transfer
example/neural-style-transfer/main.go:156:19: cannot use &inputLayers[idx] (value of type **ts.Tensor) as *ts.Tensor value in argument to styleLoss
example/neural-style-transfer/main.go:156:38: cannot use &styleLayers[idx] (value of type **ts.Tensor) as *ts.Tensor value in argument to styleLoss
example/neural-style-transfer/main.go:162:38: cannot use &contentLayers[idx] (value of type **ts.Tensor) as *ts.Tensor value in argument to inputLayers[idx].MustMseLoss
?       github.com/sugarme/gotch/example/mnist  [no test files]
?       github.com/sugarme/gotch/example/mnist-fp16     [no test files]
# github.com/sugarme/gotch/example/scheduler
example/scheduler/main.go:39:18: cannot use []ts.Tensor{…} (value of type []ts.Tensor) as []*ts.Tensor value in argument to o.AddParamGroup
?       github.com/sugarme/gotch/example/pickle [no test files]
?       github.com/sugarme/gotch/example/pretrained-model       [no test files]
?       github.com/sugarme/gotch/example/tensor-grad    [no test files]
# github.com/sugarme/gotch/example/translation
example/translation/main.go:84:22: cannot use []ts.Tensor{…} (value of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/translation/main.go:106:29: cannot use []ts.Tensor{…} (value of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/translation/main.go:110:57: attnWeights.MustBmm(encOutputsTs, true).MustSqueeze1 undefined (type *ts.Tensor has no field or method MustSqueeze1)
example/translation/main.go:113:20: cannot use []ts.Tensor{…} (value of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/translation/main.go:159:26: cannot use encOutputs (variable of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustStack
example/translation/main.go:219:26: cannot use encOutputs (variable of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustStack
# github.com/sugarme/gotch/example/yolo
example/yolo/darknet.go:401:29: cannot use []ts.Tensor{…} (value of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/yolo/darknet.go:515:23: cannot use layers (variable of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/yolo/darknet.go:543:21: cannot use detections (variable of type []ts.Tensor) as []*ts.Tensor value in argument to ts.MustCat
example/yolo/main.go:193:20: imageTmp.MustDiv1 undefined (type *ts.Tensor has no field or method MustDiv1)
?       github.com/sugarme/gotch/example/tensor-io      [no test files]
?       github.com/sugarme/gotch/example/transfer-learning      [no test files]
?       github.com/sugarme/gotch/example/yolo/freetype  [no test files]
?       github.com/sugarme/gotch/libtch [no test files]
ok      github.com/sugarme/gotch/half   (cached)
?       github.com/sugarme/gotch/vision [no test files]
?       github.com/sugarme/gotch/vision/aug     [no test files]
+---------------------------------------------------------------------------+
| Memory Stats: Start                                                       |
+---------------------------------------------------------------------------+
|  Allocated heap objects                                              773  |
|  Released heap objects                                                19  |
|  Living heap objects                                                 754  |
|  Memory in use by heap objects (bytes)                            298272  |
|  Reserved memory (by Go runtime for heap, stack,...) (bytes)    11926544  |
|  Total pause time by GC (nanoseconds)                                  0  |
|  Number of GC called                                                   0  |
+---------------------------------------------------------------------------+
vs created...
vs deleted...
+---------------------------------------------------------------------------+
| Memory Stats: Final                                                       |
+---------------------------------------------------------------------------+
|  Allocated heap objects                                             6726  |
|  Released heap objects                                              5977  |
|  Living heap objects                                                 749  |
|  Memory in use by heap objects (bytes)                            325360  |
|  Reserved memory (by Go runtime for heap, stack,...) (bytes)    83575064  |
|  Total pause time by GC (nanoseconds)                             542866  |
|  Number of GC called                                                  10  |
+---------------------------------------------------------------------------+
Loss: 23.000
Loss: 0.336
Loss: 0.307
Loss: 0.281
Loss: 0.257
2024/01/17 17:16:12 Libtorch API Error: element 0 of tensors does not require grad and does not have a grad_fn
Exception raised from run_backward at ../torch/csrc/autograd/autograd.cpp:109 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fe8eb5ac1fb in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7fe8eb5a6d6f in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libc10.so)
frame #2: <unknown function> + 0x45f4db8 (0x7fe8d9bf4db8 in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libtorch_cpu.so)
frame #3: torch::autograd::backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x6a (0x7fe8d9bf82ea in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x46599bd (0x7fe8d9c599bd in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libtorch_cpu.so)
frame #5: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 0x49 (0x7fe8d69a09f9 in /home/oliver/Documents/code/main/go/dist/libs/libtorch_cpu/lib/libtorch_cpu.so)
frame #6: at_backward + 0x4f (0x742edf in /tmp/go-build428359870/b198/nn.test)
frame #7: /tmp/go-build428359870/b198/nn.test() [0x55d2c4]

goroutine 34 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/sugarme/gotch/ts.TorchErr()
        /home/oliver/Documents/code/gotch/ts/error.go:45 +0x4b
github.com/sugarme/gotch/ts.(*Tensor).Backward(0x616159?)
        /home/oliver/Documents/code/gotch/ts/tensor.go:813 +0x2a
github.com/sugarme/gotch/ts.(*Tensor).MustBackward(0x681ef9?)
        /home/oliver/Documents/code/gotch/ts/tensor.go:821 +0x19
github.com/sugarme/gotch/nn.(*Optimizer).BackwardStep(0xc00007d3e0, 0xc000014018?)
        /home/oliver/Documents/code/gotch/nn/optimizer.go:316 +0x7b
github.com/sugarme/gotch/nn_test.TestOptimizer(0xc00050ed00)
        /home/oliver/Documents/code/gotch/nn/optimizer_test.go:53 +0x57e
testing.tRunner(0xc00050ed00, 0x83ad18)
        /usr/local/go/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1629 +0x3ea

FAIL    github.com/sugarme/gotch/nn     17.413s
--- FAIL: ExampleLoadInfo (55.28s)

@sugarme
Copy link
Owner

sugarme commented Jan 18, 2024

@oliverdain,

From the log, I think gotch had actually compiled. Some tests failed as result of API changes from []ts.Tensor (older API) to []*ts.Tensor that I thought would have been fixed already. As long as you can create a simple example with some tensor operations and get what you expect, then I think gotch is Okay with the binding and APIs, just leave out the unit tests in subpackages.

Please stick to libtorch 2.1.0 as the latest.

I created a branch https://github.com/sugarme/gotch/tree/buildtag that you can PR for tracking.

Cheers,

@oliverdain
Copy link
Author

Just sent you a PR that fixes the unit tests. I'll work on the build tag thing tomorrow.

@sugarme
Copy link
Owner

sugarme commented Jan 23, 2024

@oliverdain,

Thanks for the fix.

@nullbull
Copy link

I very much agree with this user's suggestion to use Tag to replace script production code. I plan to use this library in the production environment, but because this code is generated, there is no way to execute the script to replace this code using the company's platform, so I can only fork it out. Force lib.go to be hardcoded for CPU use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants