识别模型训练时RuntimeError: CUDA out of memory #95

light201212 · 2020-09-10T07:18:47Z

请教下各位学霸，为什么我尝试训练识别模型时显存好像在不停的增长，然后就RuntimeError: CUDA out of memory，这个要怎么解决？
（1080的老卡，bs=16，开始运行时只要3G多显存，逐渐上升。）
2020-09-10 06:55:41,145 - torchocr - INFO - [0/200] - [7650/85482] - lr:0.001 - loss:0.8423 - acc:0.2500 - norm_edit_dis:0.8858 - time:3.2507
2020-09-10 06:55:44,284 - torchocr - INFO - [0/200] - [7660/85482] - lr:0.001 - loss:0.8961 - acc:0.1875 - norm_edit_dis:0.8449 - time:3.1395
2020-09-10 06:55:47,402 - torchocr - INFO - [0/200] - [7670/85482] - lr:0.001 - loss:0.5101 - acc:0.5625 - norm_edit_dis:0.9381 - time:3.1169
2020-09-10 06:55:49,011 - torchocr - ERROR - Traceback (most recent call last):
File "tools/rec_train.py", line 237, in train
loss_dict['loss'].backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 232.00 MiB (GPU 0; 7.93 GiB total capacity; 6.42 GiB already allocated; 110.19 MiB free; 7.22 GiB reserved in total by PyTorch)
Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f19c980e1e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1e64b (0x7f19c9a6464b in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1f464 (0x7f19c9a65464 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1faa1 (0x7f19c9a65aa1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x11e (0x7f19cc78c52e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xf51329 (0x7f19cabc8329 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf6b157 (0x7f19cabe2157 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x10e9c7d (0x7f1a0194cc7d in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0x10e9f97 (0x7f1a0194cf97 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0xfa (0x7f1a01a57a1a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::native::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x49e (0x7f1a016d5c3e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x12880c1 (0x7f1a01aeb0c1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x12c3863 (0x7f1a01b26863 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x101 (0x7f1a01a3ab31 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::native::contiguous(at::Tensor const&, c10::MemoryFormat) + 0x89 (0x7f1a016f2469 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x1290470 (0x7f1a01af3470 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x12c351f (0x7f1a01b2651f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::contiguous(c10::MemoryFormat) const + 0xe8 (0x7f1a01b912e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor at::native::(anonymous namespace)::host_softmax_backward<at::native::(anonymous namespace)::LogSoftMaxBackwardEpilogue, true>(at::Tensor const&, at::Tensor const&, long, bool) + 0x14b (0x7f19cc01826b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #19: at::native::log_softmax_backward_cuda(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x65a (0x7f19cc0026da in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #20: + 0xf3efa0 (0x7f19cabb5fa0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #21: + 0x11141d6 (0x7f1a019771d6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #22: at::_log_softmax_backward_data(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x119 (0x7f1a01a05649 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x2ec639f (0x7f1a0372939f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #24: + 0x11141d6 (0x7f1a019771d6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #25: at::_log_softmax_backward_data(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x119 (0x7f1a01a05649 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::generated::LogSoftmaxBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1d7 (0x7f1a035a5057 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x3375bb7 (0x7f1a03bd8bb7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #28: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f1a03bd4400 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #29: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f1a03bd4fa1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #30: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f1a03bcd119 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #31: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f1a1136dc8a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #32: + 0xc70f (0x7f1a10a3070f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #33: + 0x76ba (0x7f1a1454a6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #34: clone + 0x6d (0x7f1a1428041d in /lib/x86_64-linux-gnu/libc.so.6)

light201212 · 2020-09-10T08:03:47Z

solve

garspace · 2021-07-29T02:32:28Z

部分图片的宽度太长了，要么剔除这部分数据，要么batch_size设置的小一点

garspace · 2021-07-29T03:10:22Z

部分图片的宽度太长了，要么剔除这部分数据，要么batch_size设置的小一点

light201212 closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

识别模型训练时RuntimeError: CUDA out of memory #95

识别模型训练时RuntimeError: CUDA out of memory #95

light201212 commented Sep 10, 2020

light201212 commented Sep 10, 2020

garspace commented Jul 29, 2021

garspace commented Jul 29, 2021

识别模型训练时RuntimeError: CUDA out of memory #95

识别模型训练时RuntimeError: CUDA out of memory #95

Comments

light201212 commented Sep 10, 2020

light201212 commented Sep 10, 2020

garspace commented Jul 29, 2021

garspace commented Jul 29, 2021