Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

识别模型训练时RuntimeError: CUDA out of memory #95

Closed
light201212 opened this issue Sep 10, 2020 · 3 comments
Closed

识别模型训练时RuntimeError: CUDA out of memory #95

light201212 opened this issue Sep 10, 2020 · 3 comments

Comments

@light201212
Copy link
Contributor

请教下各位学霸,为什么我尝试训练识别模型时显存好像在不停的增长,然后就RuntimeError: CUDA out of memory,这个要怎么解决?
(1080的老卡,bs=16,开始运行时只要3G多显存,逐渐上升。)
2020-09-10 06:55:41,145 - torchocr - INFO - [0/200] - [7650/85482] - lr:0.001 - loss:0.8423 - acc:0.2500 - norm_edit_dis:0.8858 - time:3.2507
2020-09-10 06:55:44,284 - torchocr - INFO - [0/200] - [7660/85482] - lr:0.001 - loss:0.8961 - acc:0.1875 - norm_edit_dis:0.8449 - time:3.1395
2020-09-10 06:55:47,402 - torchocr - INFO - [0/200] - [7670/85482] - lr:0.001 - loss:0.5101 - acc:0.5625 - norm_edit_dis:0.9381 - time:3.1169
2020-09-10 06:55:49,011 - torchocr - ERROR - Traceback (most recent call last):
File "tools/rec_train.py", line 237, in train
loss_dict['loss'].backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 232.00 MiB (GPU 0; 7.93 GiB total capacity; 6.42 GiB already allocated; 110.19 MiB free; 7.22 GiB reserved in total by PyTorch)
Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f19c980e1e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1e64b (0x7f19c9a6464b in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1f464 (0x7f19c9a65464 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1faa1 (0x7f19c9a65aa1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x11e (0x7f19cc78c52e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xf51329 (0x7f19cabc8329 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xf6b157 (0x7f19cabe2157 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x10e9c7d (0x7f1a0194cc7d in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0x10e9f97 (0x7f1a0194cf97 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0xfa (0x7f1a01a57a1a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::native::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x49e (0x7f1a016d5c3e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x12880c1 (0x7f1a01aeb0c1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x12c3863 (0x7f1a01b26863 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x101 (0x7f1a01a3ab31 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::native::contiguous(at::Tensor const&, c10::MemoryFormat) + 0x89 (0x7f1a016f2469 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x1290470 (0x7f1a01af3470 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x12c351f (0x7f1a01b2651f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::contiguous(c10::MemoryFormat) const + 0xe8 (0x7f1a01b912e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor at::native::(anonymous namespace)::host_softmax_backward<at::native::(anonymous namespace)::LogSoftMaxBackwardEpilogue, true>(at::Tensor const&, at::Tensor const&, long, bool) + 0x14b (0x7f19cc01826b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #19: at::native::log_softmax_backward_cuda(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x65a (0x7f19cc0026da in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #20: + 0xf3efa0 (0x7f19cabb5fa0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #21: + 0x11141d6 (0x7f1a019771d6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #22: at::_log_softmax_backward_data(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x119 (0x7f1a01a05649 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x2ec639f (0x7f1a0372939f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #24: + 0x11141d6 (0x7f1a019771d6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #25: at::_log_softmax_backward_data(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x119 (0x7f1a01a05649 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::generated::LogSoftmaxBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1d7 (0x7f1a035a5057 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x3375bb7 (0x7f1a03bd8bb7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #28: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f1a03bd4400 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #29: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f1a03bd4fa1 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #30: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f1a03bcd119 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #31: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f1a1136dc8a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #32: + 0xc70f (0x7f1a10a3070f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #33: + 0x76ba (0x7f1a1454a6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #34: clone + 0x6d (0x7f1a1428041d in /lib/x86_64-linux-gnu/libc.so.6)

@light201212
Copy link
Contributor Author

solve

@garspace
Copy link

部分图片的宽度太长了,要么剔除这部分数据,要么batch_size设置的小一点

1 similar comment
@garspace
Copy link

部分图片的宽度太长了,要么剔除这部分数据,要么batch_size设置的小一点

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants