-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template picking crashes with CUFFT_INTERNAL_ERROR #44
Comments
No, still crashes the same way if the warp_build conda environment is using CUDA 11.8. I've tried disabling the firewall temporarily as well, since it appears to be complaining about connection refused localhost, but the firewall has never stopped any other local programs from running (CryoSPARC, mainly). |
Thanks for the detailed report! The "connection refused" is interesting and I haven't come across it before. The master process communicates with the worker processes over a REST API and it seems like this communication is what's failing You mentioned disabling the firewall, did this resolve the issue? cc @dtegunov |
Thanks for the Linux Warp! 😀 Also, I tried installing CUDA 11.8 as the system CUDA and setting that as default, which also doesn't fix this.
It did not. Port changes, but it changes every time it's run. I've set it up on my main account as well (just in case). I think the port is a bit of a red herring if I'm honest as it later reports a core dump. So I think the worker fails (the CUFFT error) but the master process is complaining it can't find it and is much more verbose. After the CUFFT error is thrown, there is no output for 4-5 seconds before the (.NET?) output. I'm definitely scratching my head over why template matching makes it faceplant rather than earlier steps... Output from a firewall disabled run:
And one with the default EMDB entry (1 GPU process) (firewall still disabled):
|
@rbs-sci is there anything in |
You mean in, e.g.: There's no errors listed.
Sorry for delay replying. Public holiday here and was trying to grab a little downtime. |
OK, a little experimentation shows that it might be memory usage after all. What prompted that is the "15770" vmem above. I ran template picking with So the CUFFT/CUIFFT crash is reproducible (but whether FFT or IFFT runs out of VRAM seems random; on a system with 6xA4000 GPUs, five were used and four crashed with CUFFT errors and one crashed with a CUIFFT error. Running I'll experiment a little and see whether reconstructing a higher binned tomogram influences picking VRAM usage so that it can be done on A4000s. |
This sounds like a good reason to add a parameter to influence the memory footprint. I'll look into it. |
@rbs-sci thanks for getting back to us and hope you enjoyed the downtime! 🙂 Some extra context, subvolumes are batched along the Y dimension of the tomogram for matching. If your 2D data have a large pixel size (e.g. 2.5-3Å) then 10A/px is not very downsampled and yields large tomograms and thus a large number of subvolumes in that batch. I assume @dtegunov is going to try to make the memory requirements independent of tomogram size so we can avoid this in the future 🙂 |
Thanks @alisterburt! 😄 Also thanks @dtegunov for looking into this further. I was testing this with the EMPIAR script you provided, now it's all working I'll be applying it to my data. Looking forward to it. It might be related to the number of projections generated and tested against? 1536 is a lot of views for an octahedral template, the ability to control how many templates are used might be advantageous... It's interesting because all the earlier stages (e.g.: motion correction) will happily run four processes on a single 16GB A4000 and not run out of memory. |
I also ran into this error during the EMPIAR script with 4x 24GB RTX 3090. After changing to |
Going to close here - the script is not designed to be run on everyones machines, it was for us to test on our infrastructure. The docs (current version at https://warpem.github.io/warp) don't default to |
Thanks, I should have marked as closed earlier, apologies. |
aa63b22 reduces the default consumption below 8 GB, and adds a --batch_angles parameter to ts_template_match to regulate it. |
System:
Ubuntu 22.04 (latest updates)
CUDA 12.4 (driver 550.54.15)
Ryzen 5700G
64GB RAM
Quadro A4000 (16GB)
Linux Warp/M appears to build correctly using provided scripts - a few warnings, but nothing looks related.
Test script runs great until template picking, at which point the following error occurs:
Initially I thought it was running out of VRAM due to running two processes on a single GPU while sampling so many orientations, but the same thing happens if I set processes to 1 and use a 48GB A6000 on a 128 core, 1TB RAM server (also baremetal Linux, not WSL). I also tested a different EMDB map which was smaller (hence log above). No job their either.
Appears to be similar to issue here: pytorch/pytorch#88038
Not tried using CUDA 11.8 yet, it's next on my to do list.
if CUDA 11.8 works, will update accordingly.
The text was updated successfully, but these errors were encountered: