Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template picking crashes with CUFFT_INTERNAL_ERROR #44

Closed
rbs-sci opened this issue May 4, 2024 · 13 comments
Closed

Template picking crashes with CUFFT_INTERNAL_ERROR #44

rbs-sci opened this issue May 4, 2024 · 13 comments

Comments

@rbs-sci
Copy link

rbs-sci commented May 4, 2024

System:
Ubuntu 22.04 (latest updates)
CUDA 12.4 (driver 550.54.15)
Ryzen 5700G
64GB RAM
Quadro A4000 (16GB)

Linux Warp/M appears to build correctly using provided scripts - a few warnings, but nothing looks related.

Test script runs great until template picking, at which point the following error occurs:

File search will be relative to /data/warp/r5_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
Downloading map from EMDB: 100.00%                                                                                                
Extracting downloaded map... Done
Setting --template_angpix to 1.912 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/warp/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/warp/r5_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                     
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:39631))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:39631)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/warp/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/warp/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/warp/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/warp/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/warp/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/warp/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/warp/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/warp/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)

Initially I thought it was running out of VRAM due to running two processes on a single GPU while sampling so many orientations, but the same thing happens if I set processes to 1 and use a 48GB A6000 on a 128 core, 1TB RAM server (also baremetal Linux, not WSL). I also tested a different EMDB map which was smaller (hence log above). No job their either.

Appears to be similar to issue here: pytorch/pytorch#88038

Not tried using CUDA 11.8 yet, it's next on my to do list.

if CUDA 11.8 works, will update accordingly.

@rbs-sci
Copy link
Author

rbs-sci commented May 4, 2024

No, still crashes the same way if the warp_build conda environment is using CUDA 11.8.

I've tried disabling the firewall temporarily as well, since it appears to be complaining about connection refused localhost, but the firewall has never stopped any other local programs from running (CryoSPARC, mainly).

@alisterburt
Copy link
Contributor

Thanks for the detailed report!

The "connection refused" is interesting and I haven't come across it before. The master process communicates with the worker processes over a REST API and it seems like this communication is what's failing

You mentioned disabling the firewall, did this resolve the issue?

cc @dtegunov

@rbs-sci
Copy link
Author

rbs-sci commented May 4, 2024

Thanks for the detailed report!

Thanks for the Linux Warp! 😀

Also, I tried installing CUDA 11.8 as the system CUDA and setting that as default, which also doesn't fix this.

You mentioned disabling the firewall, did this resolve the issue?

It did not. Port changes, but it changes every time it's run.

I've set it up on my main account as well (just in case).

I think the port is a bit of a red herring if I'm honest as it later reports a core dump. So I think the worker fails (the CUFFT error) but the master process is complaining it can't find it and is much more verbose. After the CUFFT error is thrown, there is no output for 4-5 seconds before the (.NET?) output. I'm definitely scratching my head over why template matching makes it faceplant rather than earlier steps...

Output from a firewall disabled run:

Running command ts_template_match with:
tomo_angpix = 10
template_path = null
template_emdb = 37176
template_angpix = null
template_diameter = 130
template_flip = False
symmetry = O
subdivisions = 3
peak_distance = null
npeaks = 2000
dont_normalize = False
dont_whiten = False
reuse_results = False
check_hand = 0
subvolume_size = 192
device_list = {  }
perdevice = 1
workers = {  }
settings = warp_tiltseries.settings
input_data = {  }
input_data_recursive = False
input_processing = null
output_processing = null

No alternative input specified, will use input parameters from warp_tiltseries.settings
File search will be relative to /data/ray/warp_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
EMD-37176 already exists in /data/ray/warp_apoF_test/warp_tiltseries/template/emd_37176.mrc, skipping download
Setting --template_angpix to 1.912 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/ray/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/ray/warp_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                   
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:42641))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:42641)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/ray/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/ray/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/ray/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/ray/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/ray/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/ray/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)
[1]    89077 IOT instruction (core dumped)  WarpTools ts_template_match --settings warp_tiltseries.settings --tomo_angpix

And one with the default EMDB entry (1 GPU process) (firewall still disabled):

Running command ts_template_match with:
tomo_angpix = 10
template_path = null
template_emdb = 15854
template_angpix = null
template_diameter = 130
template_flip = False
symmetry = O
subdivisions = 3
peak_distance = null
npeaks = 2000
dont_normalize = False
dont_whiten = False
reuse_results = False
check_hand = 0
subvolume_size = 192
device_list = {  }
perdevice = 1
workers = {  }
settings = warp_tiltseries.settings
input_data = {  }
input_data_recursive = False
input_processing = null
output_processing = null

No alternative input specified, will use input parameters from warp_tiltseries.settings
File search will be relative to /data/ray/warp_apoF_test/tomostar
5 files found
Parsing previous results for each item, if available...
5/5, previous metadata found for 5                                                                                                
Downloading map from EMDB: 100.00%                                                                                                
Extracting downloaded map... Done
Setting --template_angpix to 0.7289999 based on template map
Using 1536 orientations for matching
Connecting to workers...
Connected to 1 workers
0/5terminate called after throwing an instance of 'std::runtime_error'
  what():  cuFFT error: CUFFT_INTERNAL_ERROR at /home/ray/build/warp/NativeAcceleration/gtom/src/FFT/IFFT.cu:24

Failed to process /data/ray/warp_apoF_test/warp_tiltseries/TS_11.tomostar, marked as unselected                                   
Unhandled exception. System.AggregateException: One or more errors occurred. (Connection refused (localhost:37821))
 ---> System.Net.Http.HttpRequestException: Connection refused (localhost:37821)
 ---> System.Net.Sockets.SocketException (111): Connection refused
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Warp.WorkerConsole.SetFileOutput(String path) in /home/ray/build/warp/WarpLib/WorkerWrapper.cs:line 540
   at WarpTools.Commands.BaseCommand.<>c__DisplayClass1_0.<IterateOverItems>b__0(Int32 iitem, Int32 threadID) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 103
   at Warp.Tools.Helper.ForCPUGreedy(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown) in /home/ray/build/warp/WarpLib/Tools/Helper.cs:line 786
   at WarpTools.Commands.BaseCommand.IterateOverItems(WorkerWrapper[] workers, BaseOptions cli, Action`2 body, Int32 oversubscribe) in /home/ray/build/warp/WarpTools/Commands/BaseCommand.cs:line 67
   at WarpTools.Commands.TemplateMatchTiltseries.Run(Object options) in /home/ray/build/warp/WarpTools/Commands/Tiltseries/TemplateMatchTiltseries.cs:line 301
   at WarpTools.WarpTools.Run(Object options) in /home/ray/build/warp/WarpTools/Program.cs:line 30
   at Warp.Tools.CommandLineParserHelper.ParseAndRun(String[] args, Func`2 run, Type[] verbs, String appName) in /home/ray/build/warp/WarpLib/Tools/CommandLineParserHelper.cs:line 26
   at WarpTools.WarpTools.Main(String[] args) in /home/ray/build/warp/WarpTools/Program.cs:line 17
   at WarpTools.WarpTools.<Main>(String[] args)
[1]    90022 IOT instruction (core dumped)  WarpTools ts_template_match --settings warp_tiltseries.settings --tomo_angpix

@alisterburt
Copy link
Contributor

@rbs-sci is there anything in <processing_dir>/logs that might tell us what's going on? The tracebacks from each worker process should be written into log files in that directory

@rbs-sci
Copy link
Author

rbs-sci commented May 7, 2024

You mean in, e.g.: warp_tiltseries? There is no log directory in the main processing directory.

There's no errors listed.

2024-05-02 09:03:18.552 Received "TomoMatch", with 3 arguments, for GPU #0, 15770 MB free:
2024-05-02 09:03:18.773 Loading...
2024-05-02 09:03:18.774 0%
2024-05-02 09:03:18.969 Preparing template...
2024-05-02 09:03:18.969 0%
2024-05-02 09:03:24.361 Matching...
2024-05-02 09:03:24.361 0%

Sorry for delay replying. Public holiday here and was trying to grab a little downtime.

@rbs-sci
Copy link
Author

rbs-sci commented May 7, 2024

OK, a little experimentation shows that it might be memory usage after all. What prompted that is the "15770" vmem above.

I ran template picking with per_device 1 on a new build straight on our A6000 box. Template picking seems to spike vmem when starting, although it might be hard to log (?) as I only caught it on one tomogram (TS_23) and trying again I can't catch it with nvidia-smi.

So the CUFFT/CUIFFT crash is reproducible (but whether FFT or IFFT runs out of VRAM seems random; on a system with 6xA4000 GPUs, five were used and four crashed with CUFFT errors and one crashed with a CUIFFT error. Running per_device 2 on an A5000 gave the same crash. 1 process on the A6000 spiked VRAM to 16.88GB, which would explain the crash on the A4000 and 2 process A5000. As I said in the first post, I also tried 1 process on the A6000, but I guess I screwed something up because the fresh build works.

I'll experiment a little and see whether reconstructing a higher binned tomogram influences picking VRAM usage so that it can be done on A4000s.

@dtegunov
Copy link
Contributor

dtegunov commented May 7, 2024

This sounds like a good reason to add a parameter to influence the memory footprint. I'll look into it.

@alisterburt
Copy link
Contributor

@rbs-sci thanks for getting back to us and hope you enjoyed the downtime! 🙂

Some extra context, subvolumes are batched along the Y dimension of the tomogram for matching. If your 2D data have a large pixel size (e.g. 2.5-3Å) then 10A/px is not very downsampled and yields large tomograms and thus a large number of subvolumes in that batch. I assume @dtegunov is going to try to make the memory requirements independent of tomogram size so we can avoid this in the future 🙂

@rbs-sci
Copy link
Author

rbs-sci commented May 7, 2024

Thanks @alisterburt! 😄 Also thanks @dtegunov for looking into this further.

I was testing this with the EMPIAR script you provided, now it's all working I'll be applying it to my data. Looking forward to it.

It might be related to the number of projections generated and tested against? 1536 is a lot of views for an octahedral template, the ability to control how many templates are used might be advantageous...

It's interesting because all the earlier stages (e.g.: motion correction) will happily run four processes on a single 16GB A4000 and not run out of memory.

@dmichalak
Copy link

dmichalak commented May 9, 2024

I also ran into this error during the EMPIAR script with 4x 24GB RTX 3090. After changing to per_device 1, template matching worked and each GPU reached 16.5/24 GB during the process.

@alisterburt
Copy link
Contributor

alisterburt commented May 10, 2024

Going to close here - the script is not designed to be run on everyones machines, it was for us to test on our infrastructure. The docs (current version at https://warpem.github.io/warp) don't default to --perdevice 2 and explain the --perdevice mechanism for people 🙂

@rbs-sci
Copy link
Author

rbs-sci commented May 10, 2024

Thanks, I should have marked as closed earlier, apologies.

@dtegunov
Copy link
Contributor

aa63b22 reduces the default consumption below 8 GB, and adds a --batch_angles parameter to ts_template_match to regulate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants