GTXDirectSamplerEngine with large sample sizes crashes #7

gurvesh · 2019-07-17T11:35:23Z

Hi Dragan. The implementation of GTXDirectSamplerEngine in uncomplicate.bayadera.internal.device seems a bit off. Basically if you pass in a sample size more than a certain number, it crashes due to some Cuda error (CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS). This is only true for the Gaussian and Uniform distributions, which use this implementation . I know you've implemented Uniform and Gaussian directly into the Neanderthal library v. 25, but Bayadera is incompatible with it. Would you mind please taking a look, when you can? Thanks

blueberry · 2019-07-17T13:00:14Z

yes. can you please create a midje test that demonstrates the error?

gurvesh · 2019-07-17T17:08:22Z

Sure. Here you go:

`
(ns testing-uniform
(:use (clojure repl))
(:require [uncomplicate.neanderthal
[core :refer :all]
[native :refer :all]]
[midje.sweet :refer :all]
[uncomplicate.bayadera
[core :refer [sampler dataset density distribution evidence sample! sample]]
[library :as nl]
[cuda :refer [with-default-bayadera]]]
[uncomplicate.commons.core :refer [with-release]]))

(defn check-sample [n]
(facts
(with-default-bayadera
(with-release [uniform-dist (nl/uniform 1 2)
test-sampler (sampler uniform-dist)
test-sample (sample test-sampler n)]

   ((native (row test-sample 0))) => n ))))

`

Note that:

The above will pass for small values of n (on my computer, n <= 1,000,000 works fine) - but anything higher will kill it (and once the error is triggered, any lower of value of n will still keep triggering it - you need to restart lein).

Exception thrown: clojure.lang.ExceptionInfo (CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS.)

error - uncomplicate.clojurecuda.internal.utils - (utils.clj:37)
fn - uncomplicate.clojurecuda.internal.impl.CUModule - (impl.clj:91)
release - uncomplicate.clojurecuda.internal.impl.CUModule - (impl.clj:91)
release - uncomplicate.bayadera.internal.device.nvidia-gtx.GTXDatasetEngine - (nvidia_gtx.clj:155)
release - uncomplicate.bayadera.internal.device.nvidia-gtx.GTXBayaderaFactory - (nvidia_gtx.clj:763)
check-sample/fn/fn - testing-uniform - (form-init10386720772629001284.clj:14)
with-altered-roots* - midje.util.thread-safe-var-nesting - (thread_safe_var_nesting.clj:32)
check-sample/fn - testing-uniform - (form-init10386720772629001284.clj:14)
applyToHelper - (AFn.java:152)
applyTo - (AFn.java:144)
doInvoke - (AFunction.java:31)
invoke - (RestFn.java:397)
check-one/fn - midje.checking.facts - (facts.clj:32)
check-one - midje.checking.facts - (facts.clj:31)
creation-time-check - midje.checking.facts - (facts.clj:35)
check-sample - testing-uniform - (form-init10386720772629001284.clj:13)
eval60767 - testing-uniform - (form-init10386720772629001284.clj:1)
eval60767 - testing-uniform - (form-init10386720772629001284.clj:1)
eval - (Compiler.java:7176)
eval - (Compiler.java:7131)
eval - clojure.core - (core.clj:3214)
eval - clojure.core - (core.clj:3210)
repl/read-eval-print/fn - clojure.main - (main.clj:414)
repl/read-eval-print - clojure.main - (main.clj:414)
repl/fn - clojure.main - (main.clj:435)
repl - clojure.main - (main.clj:435)
evaluate - nrepl.middleware.interruptible-eval - (interruptible_eval.clj:106)
interruptible-eval/fn/fn - nrepl.middleware.interruptible-eval - (interruptible_eval.clj:140)
run - (AFn.java:22)
session-exec/main-loop/fn - nrepl.middleware.session - (session.clj:171)
session-exec/main-loop - nrepl.middleware.session - (session.clj:170)
run - (AFn.java:22)
run - (Thread.java:748)

If you replace nl/uniform with nl/beta or any other distribution that doesn't use the GTXDirectSampler, it will work for values of n upto around 400,000,000 (provided the code in the pull request I created is implemented first - otherwise it will fail for other reasons). After that it gives another error of the form:
Actual:
java.lang.IllegalArgumentException: Value out of range for int: 4000000000

Hope this helps!

blueberry · 2019-07-17T21:41:52Z

OK. Let me finish a book chapter (1-2 days) and then I'll take a look at this and your pull request, and probably update bayadera to the latest neanderthal, too.

gurvesh · 2019-07-18T08:40:27Z

Look forward to it!

blueberry · 2019-07-21T16:06:28Z

I'm working on this. Currently, there are many updates that I need to add to bayadera to make it keep up to the latest neanderthal.

Particularly, switching my AMD GPU from 290X to Vega64 a couple months ago required switching the drivers from old proprietary catalyst to the newest open-source ROCm raised many OpenCL compatibility breakages (their new OpenCL C compiler does not swallow many things that were allowed by catalyst).

This error might be caused by many changes that I made to bayadera in this migration process months ago. Now I'll see to make a thorough pass and update everything. I hope that it will reveal what makes this particular error. Anyway, this might take more than a few afternoons, but expect updates in a few days.

gurvesh · 2019-07-21T16:42:54Z

Ah OK. That would be great (the "upgrade everything" part). But I'll just say that my error was on Cuda, not OpenCL. And if you look at the GTXDirectSamplerEngine code in the uncomplicate.bayadera.internal.device.nvidia_gtx.clj, it just seems like an older version of the code (lacks with-release etc), while the OpenCL version in the Amd_gcn.clj file (GCNDirectSamplerEngine) has a lot more idiomatic code - as in, similar to the rest of of the library. But yes, getting everything to work with the latest Neanderthal would be awesome!

Btw - any plans of making v1.0.0 of everything anytime soon? :)

blueberry · 2019-07-21T18:49:54Z

The CUDA/OpenCL discrepancy is not due to and older version of code, but by constraints on what is possible with cuda and OpenCL. That is exactly why I need to tidy everything up first and then fix this error. Otherwise, CUDA and OpenCL implementation would diverge, which would make them much harder to properly maintain. I am already using some advanced code there, and the behavior is highly dependent on OpenCL drivers, that I regularly do not have anywhere to look up for a solution, and am left to experimenting. In fact, this migration to ROCm caused a block where some AMD code won't run as expected even with fixes that run in other places, and there are zero results for the error and the wierd behavior that I get...

blueberry · 2019-07-23T16:48:18Z

Update: I've solved other showstoppers, so this issue will be next to look at (but not today).

blueberry · 2019-07-24T21:07:41Z

About this particular issue: there is a bug in the GTXSamplerEngines that launches too many threads. Instead of (dim res), the number of threads should be 4 times less (let's assume that you use power of two size increments, so this does not support dim=1234567). So, instead of:

bayadera/src/clojure/uncomplicate/bayadera/internal/device/nvidia_gtx.clj

Line 57 in a5e087d

(do (launch! sample-kernel (grid-1d (dim res) WGS) hstream

(do (launch! sample-kernel (grid-1d (/ (dim res) 4) WGS) hstream

As I decided to change direct sampler substantially, the update will not come that soon, so please make this change in your own copy of the project.

I am not sure why this error does not manifest with smaller arrays. It should fail with any size, but it doesn't.

gurvesh · 2019-07-25T09:18:00Z

Thanks a lot for your time, and it works!

blueberry · 2019-07-29T22:49:52Z

Fixed in latest commits.

blueberry closed this as completed Jul 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTXDirectSamplerEngine with large sample sizes crashes #7

GTXDirectSamplerEngine with large sample sizes crashes #7

gurvesh commented Jul 17, 2019

blueberry commented Jul 17, 2019

gurvesh commented Jul 17, 2019 •

edited

blueberry commented Jul 17, 2019

gurvesh commented Jul 18, 2019

blueberry commented Jul 21, 2019

gurvesh commented Jul 21, 2019

blueberry commented Jul 21, 2019

blueberry commented Jul 23, 2019

blueberry commented Jul 24, 2019 •

edited

gurvesh commented Jul 25, 2019

blueberry commented Jul 29, 2019

GTXDirectSamplerEngine with large sample sizes crashes #7

GTXDirectSamplerEngine with large sample sizes crashes #7

Comments

gurvesh commented Jul 17, 2019

blueberry commented Jul 17, 2019

gurvesh commented Jul 17, 2019 • edited

blueberry commented Jul 17, 2019

gurvesh commented Jul 18, 2019

blueberry commented Jul 21, 2019

gurvesh commented Jul 21, 2019

blueberry commented Jul 21, 2019

blueberry commented Jul 23, 2019

blueberry commented Jul 24, 2019 • edited

gurvesh commented Jul 25, 2019

blueberry commented Jul 29, 2019

gurvesh commented Jul 17, 2019 •

edited

blueberry commented Jul 24, 2019 •

edited