-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GTXDirectSamplerEngine with large sample sizes crashes #7
Comments
yes. can you please create a midje test that demonstrates the error? |
Sure. Here you go: ` (defn check-sample [n]
` Note that:
Exception thrown: clojure.lang.ExceptionInfo (CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS.)
Hope this helps! |
OK. Let me finish a book chapter (1-2 days) and then I'll take a look at this and your pull request, and probably update bayadera to the latest neanderthal, too. |
Look forward to it! |
I'm working on this. Currently, there are many updates that I need to add to bayadera to make it keep up to the latest neanderthal. Particularly, switching my AMD GPU from 290X to Vega64 a couple months ago required switching the drivers from old proprietary catalyst to the newest open-source ROCm raised many OpenCL compatibility breakages (their new OpenCL C compiler does not swallow many things that were allowed by catalyst). This error might be caused by many changes that I made to bayadera in this migration process months ago. Now I'll see to make a thorough pass and update everything. I hope that it will reveal what makes this particular error. Anyway, this might take more than a few afternoons, but expect updates in a few days. |
Ah OK. That would be great (the "upgrade everything" part). But I'll just say that my error was on Cuda, not OpenCL. And if you look at the GTXDirectSamplerEngine code in the uncomplicate.bayadera.internal.device.nvidia_gtx.clj, it just seems like an older version of the code (lacks with-release etc), while the OpenCL version in the Amd_gcn.clj file (GCNDirectSamplerEngine) has a lot more idiomatic code - as in, similar to the rest of of the library. But yes, getting everything to work with the latest Neanderthal would be awesome! Btw - any plans of making v1.0.0 of everything anytime soon? :) |
The CUDA/OpenCL discrepancy is not due to and older version of code, but by constraints on what is possible with cuda and OpenCL. That is exactly why I need to tidy everything up first and then fix this error. Otherwise, CUDA and OpenCL implementation would diverge, which would make them much harder to properly maintain. I am already using some advanced code there, and the behavior is highly dependent on OpenCL drivers, that I regularly do not have anywhere to look up for a solution, and am left to experimenting. In fact, this migration to ROCm caused a block where some AMD code won't run as expected even with fixes that run in other places, and there are zero results for the error and the wierd behavior that I get... |
Update: I've solved other showstoppers, so this issue will be next to look at (but not today). |
About this particular issue: there is a bug in the GTXSamplerEngines that launches too many threads. Instead of (dim res), the number of threads should be 4 times less (let's assume that you use power of two size increments, so this does not support dim=1234567). So, instead of:
(do (launch! sample-kernel (grid-1d (/ (dim res) 4) WGS) hstream As I decided to change direct sampler substantially, the update will not come that soon, so please make this change in your own copy of the project. I am not sure why this error does not manifest with smaller arrays. It should fail with any size, but it doesn't. |
Thanks a lot for your time, and it works! |
Fixed in latest commits. |
Hi Dragan. The implementation of GTXDirectSamplerEngine in uncomplicate.bayadera.internal.device seems a bit off. Basically if you pass in a sample size more than a certain number, it crashes due to some Cuda error (CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS). This is only true for the Gaussian and Uniform distributions, which use this implementation . I know you've implemented Uniform and Gaussian directly into the Neanderthal library v. 25, but Bayadera is incompatible with it. Would you mind please taking a look, when you can? Thanks
The text was updated successfully, but these errors were encountered: