Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval时总是报错:(External) CUDA error(700), an illegal memory access was encountered. #9333

Open
2 of 3 tasks
heyangdev opened this issue Mar 21, 2025 · 4 comments
Open
2 of 3 tasks
Assignees

Comments

@heyangdev
Copy link

问题确认 Search before asking

  • 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Validation

Bug描述 Describe the Bug

  1. train阶段没有问题,一旦开始eval就报错:(External) CUDA error(700), an illegal memory access was encountered.
  2. 训练mask_rcnn、cascade_mask_rcnn时,训练solov2没有这个问题;
  3. 训练mask_rcnn、cascade_mask_rcnn时,将学习率调整到0.1/0.01,loss震荡严重,但可以正常eval;减小学习率,loss正常下降,但eval就会报错;
  4. 报错细节:采用双卡(A100 40G)训练,train/eval时batch size为1,eval阶段时但是显存0占用却不断上升,最终显存溢出,此时显卡1的显存没有变化;
  5. 尝试过paddle2.6.2 cuda11.2 / paddle2.5.2 cuda11.2 / paddle2.3.2 cuda11.2 / paddle2.2.2 cuda11.2 没有解决;
  6. 以下是在paddle2.3.2 cuda11.2下的报错细节:

I0321 16:33:57.914856 26795 nccl_context.cc:83] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
W0321 16:33:59.374146 26795 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.1, Runtime API Version: 11.2
W0321 16:33:59.379195 26795 gpu_resources.cc:91] device: 0, cuDNN Version: 8.5.
loading annotations into memory...
Done (t=0.12s)
creating index...
index created!
[03/21 16:34:02] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/hy/.cache/paddle/weights/ResNet50_cos_pretrained.pdparams
Traceback (most recent call last):
File "tools/train.py", line 172, in
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 132, in run
trainer.train(FLAGS.eval)
File "/home/hy/PaddleDetection/ppdet/engine/trainer.py", line 506, in train
outputs = model(data)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/parallel.py", line 752, in forward
outputs = self._layers(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
out = self.get_loss()
File "/home/hy/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 125, in get_loss
rpn_loss, bbox_loss, mask_loss = self._forward()
File "/home/hy/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 87, in _forward
body_feats = self.backbone(self.inputs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/PaddleDetection/ppdet/modeling/backbones/resnet.py", line 582, in forward
x = stage(x)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/PaddleDetection/ppdet/modeling/backbones/resnet.py", line 423, in forward
block_out = block(block_out)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/PaddleDetection/ppdet/modeling/backbones/resnet.py", line 362, in forward
out = self.branch2c(out)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/PaddleDetection/ppdet/modeling/backbones/resnet.py", line 118, in forward
out = self.conv(inputs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/nn/layer/conv.py", line 666, in forward
out = F.conv._conv_nd(
File "/home/hy/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/nn/functional/conv.py", line 144, in _conv_nd
pre_bias = getattr(_C_ops, op_type)(x, weight, *attrs)
SystemError: (Fatal) Operator conv2d raises an paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 128.000000MB memory on GPU 0, 39.401306GB memory has been allocated and available memory is only 30.312500MB.

复现环境 Environment

Linux 18.04
paddlepaddle-gpu 2.2.2.post11.2
cuda 11.2
cudnn 8.5
python 3.8

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!
@zhangyubo0722
Copy link
Collaborator

超出显存了,减少一下batch_size

@heyangdev
Copy link
Author

超出显存了,减少一下batch_size

出现错误时的环境配置为:双A100 40G显卡,batch_size为1。

@zhangyubo0722
Copy link
Collaborator

使用paddle3.0版本并且确认评估时的bs是否设置上了呢

@heyangdev
Copy link
Author

使用paddle3.0版本并且确认评估时的bs是否设置上了呢

尝试了paddle3.0..0.post118版本(batchsize为1)仍然报同样的错误。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants