Template picking crashes with CUFFT_INTERNAL_ERROR #44

rbs-sci · 2024-05-04T06:34:43Z

System:
Ubuntu 22.04 (latest updates)
CUDA 12.4 (driver 550.54.15)
Ryzen 5700G
64GB RAM
Quadro A4000 (16GB)

Linux Warp/M appears to build correctly using provided scripts - a few warnings, but nothing looks related.

Test script runs great until template picking, at which point the following error occurs:

File search will be relative to /data/warp/r5_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
Downloading map from EMDB: 100.00%                                                                                                
Extracting downloaded map... Done
Setting --template_angpix to 1.912 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/warp/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/warp/r5_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                     
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:39631))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:39631)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/warp/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/warp/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/warp/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/warp/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/warp/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/warp/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/warp/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/warp/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)

Initially I thought it was running out of VRAM due to running two processes on a single GPU while sampling so many orientations, but the same thing happens if I set processes to 1 and use a 48GB A6000 on a 128 core, 1TB RAM server (also baremetal Linux, not WSL). I also tested a different EMDB map which was smaller (hence log above). No job their either.

Appears to be similar to issue here: pytorch/pytorch#88038

Not tried using CUDA 11.8 yet, it's next on my to do list.

if CUDA 11.8 works, will update accordingly.

The text was updated successfully, but these errors were encountered:

rbs-sci · 2024-05-04T06:57:20Z

No, still crashes the same way if the warp_build conda environment is using CUDA 11.8.

I've tried disabling the firewall temporarily as well, since it appears to be complaining about connection refused localhost, but the firewall has never stopped any other local programs from running (CryoSPARC, mainly).

alisterburt · 2024-05-04T08:33:18Z

Thanks for the detailed report!

The "connection refused" is interesting and I haven't come across it before. The master process communicates with the worker processes over a REST API and it seems like this communication is what's failing

You mentioned disabling the firewall, did this resolve the issue?

cc @dtegunov

rbs-sci · 2024-05-04T10:48:28Z

Thanks for the detailed report!

Thanks for the Linux Warp! 😀

Also, I tried installing CUDA 11.8 as the system CUDA and setting that as default, which also doesn't fix this.

You mentioned disabling the firewall, did this resolve the issue?

It did not. Port changes, but it changes every time it's run.

I've set it up on my main account as well (just in case).

I think the port is a bit of a red herring if I'm honest as it later reports a core dump. So I think the worker fails (the CUFFT error) but the master process is complaining it can't find it and is much more verbose. After the CUFFT error is thrown, there is no output for 4-5 seconds before the (.NET?) output. I'm definitely scratching my head over why template matching makes it faceplant rather than earlier steps...

Output from a firewall disabled run:

Running command ts_template_match with:
tomo_angpix = 10
template_path = null
template_emdb = 37176
template_angpix = null
template_diameter = 130
template_flip = False
symmetry = O
subdivisions = 3
peak_distance = null
npeaks = 2000
dont_normalize = False
dont_whiten = False
reuse_results = False
check_hand = 0
subvolume_size = 192
device_list = {  }
perdevice = 1
workers = {  }
settings = warp_tiltseries.settings
input_data = {  }
input_data_recursive = False
input_processing = null
output_processing = null

No alternative input specified, will use input parameters from warp_tiltseries.settings
File search will be relative to /data/ray/warp_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
EMD-37176 already exists in /data/ray/warp_apoF_test/warp_tiltseries/template/emd_37176.mrc, skipping download
Setting --template_angpix to 1.912 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/ray/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/ray/warp_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                   
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:42641))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:42641)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/ray/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/ray/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/ray/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/ray/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/ray/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/ray/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)
[1]    89077 IOT instruction (core dumped)  WarpTools ts_template_match --settings warp_tiltseries.settings --tomo_angpix

And one with the default EMDB entry (1 GPU process) (firewall still disabled):

Running command ts_template_match with:
tomo_angpix = 10
template_path = null
template_emdb = 15854
template_angpix = null
template_diameter = 130
template_flip = False
symmetry = O
subdivisions = 3
peak_distance = null
npeaks = 2000
dont_normalize = False
dont_whiten = False
reuse_results = False
check_hand = 0
subvolume_size = 192
device_list = {  }
perdevice = 1
workers = {  }
settings = warp_tiltseries.settings
input_data = {  }
input_data_recursive = False
input_processing = null
output_processing = null

No alternative input specified, will use input parameters from warp_tiltseries.settings
File search will be relative to /data/ray/warp_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
Downloading map from EMDB: 100.00%                                                                                                
Extracting downloaded map... Done
Setting --template_angpix to 0.7289999 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/ray/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/ray/warp_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                   
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:37821))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:37821)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/ray/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/ray/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/ray/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/ray/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/ray/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/ray/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)
[1]    90022 IOT instruction (core dumped)  WarpTools ts_template_match --settings warp_tiltseries.settings --tomo_angpix

alisterburt · 2024-05-04T21:13:21Z

@rbs-sci is there anything in <processing_dir>/logs that might tell us what's going on? The tracebacks from each worker process should be written into log files in that directory

rbs-sci · 2024-05-07T01:04:11Z

You mean in, e.g.: warp_tiltseries? There is no log directory in the main processing directory.

There's no errors listed.

2024-05-02 09:03:18.552 Received "TomoMatch", with 3 arguments, for GPU #0, 15770 MB free:
2024-05-02 09:03:18.773 Loading...
2024-05-02 09:03:18.774 0%
2024-05-02 09:03:18.969 Preparing template...
2024-05-02 09:03:18.969 0%
2024-05-02 09:03:24.361 Matching...
2024-05-02 09:03:24.361 0%

Sorry for delay replying. Public holiday here and was trying to grab a little downtime.

rbs-sci · 2024-05-07T02:35:34Z

OK, a little experimentation shows that it might be memory usage after all. What prompted that is the "15770" vmem above.

I ran template picking with per_device 1 on a new build straight on our A6000 box. Template picking seems to spike vmem when starting, although it might be hard to log (?) as I only caught it on one tomogram (TS_23) and trying again I can't catch it with nvidia-smi.

So the CUFFT/CUIFFT crash is reproducible (but whether FFT or IFFT runs out of VRAM seems random; on a system with 6xA4000 GPUs, five were used and four crashed with CUFFT errors and one crashed with a CUIFFT error. Running per_device 2 on an A5000 gave the same crash. 1 process on the A6000 spiked VRAM to 16.88GB, which would explain the crash on the A4000 and 2 process A5000. As I said in the first post, I also tried 1 process on the A6000, but I guess I screwed something up because the fresh build works.

I'll experiment a little and see whether reconstructing a higher binned tomogram influences picking VRAM usage so that it can be done on A4000s.

dtegunov · 2024-05-07T03:05:10Z

This sounds like a good reason to add a parameter to influence the memory footprint. I'll look into it.

alisterburt · 2024-05-07T07:55:31Z

@rbs-sci thanks for getting back to us and hope you enjoyed the downtime! 🙂

Some extra context, subvolumes are batched along the Y dimension of the tomogram for matching. If your 2D data have a large pixel size (e.g. 2.5-3Å) then 10A/px is not very downsampled and yields large tomograms and thus a large number of subvolumes in that batch. I assume @dtegunov is going to try to make the memory requirements independent of tomogram size so we can avoid this in the future 🙂

rbs-sci · 2024-05-07T08:15:06Z

Thanks @alisterburt! 😄 Also thanks @dtegunov for looking into this further.

I was testing this with the EMPIAR script you provided, now it's all working I'll be applying it to my data. Looking forward to it.

It might be related to the number of projections generated and tested against? 1536 is a lot of views for an octahedral template, the ability to control how many templates are used might be advantageous...

It's interesting because all the earlier stages (e.g.: motion correction) will happily run four processes on a single 16GB A4000 and not run out of memory.

dmichalak · 2024-05-09T19:57:43Z

I also ran into this error during the EMPIAR script with 4x 24GB RTX 3090. After changing to per_device 1, template matching worked and each GPU reached 16.5/24 GB during the process.

alisterburt · 2024-05-10T21:48:55Z

Going to close here - the script is not designed to be run on everyones machines, it was for us to test on our infrastructure. The docs (current version at https://warpem.github.io/warp) don't default to --perdevice 2 and explain the --perdevice mechanism for people 🙂

rbs-sci · 2024-05-10T22:16:31Z

Thanks, I should have marked as closed earlier, apologies.

dtegunov · 2024-05-18T01:07:12Z

aa63b22 reduces the default consumption below 8 GB, and adds a --batch_angles parameter to ts_template_match to regulate it.

alisterburt closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template picking crashes with CUFFT_INTERNAL_ERROR #44

Template picking crashes with CUFFT_INTERNAL_ERROR #44

rbs-sci commented May 4, 2024 •

edited

Loading

rbs-sci commented May 4, 2024

alisterburt commented May 4, 2024

rbs-sci commented May 4, 2024

alisterburt commented May 4, 2024

rbs-sci commented May 7, 2024

rbs-sci commented May 7, 2024

dtegunov commented May 7, 2024

alisterburt commented May 7, 2024

rbs-sci commented May 7, 2024

dmichalak commented May 9, 2024 •

edited

Loading

alisterburt commented May 10, 2024 •

edited

Loading

rbs-sci commented May 10, 2024

dtegunov commented May 18, 2024

Template picking crashes with CUFFT_INTERNAL_ERROR #44

Template picking crashes with CUFFT_INTERNAL_ERROR #44

Comments

rbs-sci commented May 4, 2024 • edited Loading

rbs-sci commented May 4, 2024

alisterburt commented May 4, 2024

rbs-sci commented May 4, 2024

alisterburt commented May 4, 2024

rbs-sci commented May 7, 2024

rbs-sci commented May 7, 2024

dtegunov commented May 7, 2024

alisterburt commented May 7, 2024

rbs-sci commented May 7, 2024

dmichalak commented May 9, 2024 • edited Loading

alisterburt commented May 10, 2024 • edited Loading

rbs-sci commented May 10, 2024

dtegunov commented May 18, 2024

rbs-sci commented May 4, 2024 •

edited

Loading

dmichalak commented May 9, 2024 •

edited

Loading

alisterburt commented May 10, 2024 •

edited

Loading