terminate called after throwing an instance of 'c10::Error' #2526

thies1006 · 2020-08-25T15:42:26Z

I am trying to train a transformer model with model parallelism following closely the megatron example from fairseq (just using complete transformer model instead gpt, same options including --fp16).
https://github.com/pytorch/fairseq/tree/master/examples/megatron_11b

My setup is: two nodes with 6 GPUs (Titan RTX) each.
Pytorch 1.6
Cuda 10.1.243
Ubuntu 18.04 LTS

The model trains, however only with 4 GPUs per node. When switching to 6 GPUs per node (plus tweaking the model/dictionary to ensure divisibility by number of GPUs) I get the following error on the second node (right when training should start):

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc06b9c91e2 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc06bc17f92 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc06b9b79cd in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x5c (0x7fc0b3262d1c in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x16b2 (0x7fc0a5d8f6b2 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7fc0a5d8ffa1 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7fc0a5d88119 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7fc0b352834a in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xbd6ef (0x7fc0b46826ef in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: + 0x76db (0x7fc0b85746db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x3f (0x7fc0b88ad88f in /lib/x86_64-linux-gnu/libc.so.6)

The model trains fine when removing the --fp16 option.
--fp16 + --ddp-backend=no_c10d does not work.

thies1006 closed this as completed Aug 25, 2020

yumi-cn mentioned this issue Dec 11, 2020

About apex and the args "--fp16" facebookresearch/NSVF#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminate called after throwing an instance of 'c10::Error' #2526

terminate called after throwing an instance of 'c10::Error' #2526

thies1006 commented Aug 25, 2020

terminate called after throwing an instance of 'c10::Error' #2526

terminate called after throwing an instance of 'c10::Error' #2526

Comments

thies1006 commented Aug 25, 2020