Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: no kernel image is available for execution on the device #2

Open
lsj1111 opened this issue Mar 12, 2024 · 11 comments

Comments

@lsj1111
Copy link

lsj1111 commented Mar 12, 2024

当我按照上述环境跑代码时,出现了这个错误,请问作者有遇到过吗,怎么解决的

@yuhongtian17
Copy link
Owner

检查您的显卡型号、"nvidia-smi" cuda版本、"nvcc -V" cuda版本和PyTorch版本是否匹配(例如RTX 4090要求CUDA 11.6及以上)。如果您难以确定,也可以将上述信息提供给我们来帮助您检查。

@lsj1111
Copy link
Author

lsj1111 commented Mar 13, 2024

但是当我把CUDA和torch版本都提升之后,感觉就无法安装您源码对应的MMCV了,会导致运行时有很多的bug,因为MMCV2.0版本有太多的更新

@yuhongtian17
Copy link
Owner

我们的代码版本是基于mmrotate-0.3.3/0.3.4编写的。也就是说只要mmrotate-0.3.3/0.3.4的例如Rotated Faster RCNN r50示例代码能跑通的话我们的代码也能跑通;官方代码也不能跑通的话请到mmrotate仓库下报告issue。应当根据您的显卡型号选择合适的(而非最新的)CUDA和PyTorch版本。

@lsj1111
Copy link
Author

lsj1111 commented Mar 22, 2024

当我将maevit的预训练权重加载进来时一切都很正常,但是当我将您的预训练权重加载进来时,似乎不能成功加载导致效果很差,代码如下所示:
pretrained = r'D:\Tool\Datasets\pretrained_weight\Spatial-Transform-Decoupling-Models\pretrained\mae_pretrain_vit_base_full.pth'

angle_version = 'le90'
norm_cfg = dict(type='LN', requires_grad=True)
model = dict(
type='RotatedimTED',
#pretrained=pretrained,
proposals_dim=6,
backbone=dict(
type='VisionTransformer',
init_cfg=dict(type='Pretrained', checkpoint=pretrained),
img_size=224,

@yuhongtian17
Copy link
Owner

这些描述不足够定位错误。是否可以提供更详细的配置文件和报错信息呢?

@lsj1111
Copy link
Author

lsj1111 commented Mar 28, 2024

我总结了一下问题,你提供的可下载的预训练权重中包含多个文件:第一个是MAE预训练的VIT权重-> mae_pretrain_vit_base_full.pth,第二个是你引入了STD模块后训练了12个epoch的权重 ->epoch_12.pth,但是现在有一个很奇怪的事情,当我拿第一个权重训练时得到的MAP值居然是远大于第二个的,理论上来说不是应该第二个权重训练出来的效果会更好吗,所以我很疑惑,还请解答。

@lsj1111
Copy link
Author

lsj1111 commented Mar 28, 2024

当我加载你的epoch_12权重时,出现了下述报错信息:
missing keys in source state_dict: fc_cls.weight, fc_cls.bias, decoder_blocks.2.layer_reg.norms.0.weight, decoder_blocks.2.layer_reg.norms.0.bias, decoder_blocks.2.layer_reg.norms.1.weight, decoder_blocks.2.layer_reg.norm
s.1.bias, decoder_blocks.2.layer_reg.norms.2.weight, decoder_blocks.2.layer_reg.norms.2.bias, decoder_blocks.2.layer_reg.convs.0.weight, decoder_blocks.2.layer_reg.convs.0.bias, decoder_blocks.2.layer_reg.convs.1.weight,
decoder_blocks.2.layer_reg.convs.1.bias, decoder_blocks.2.layer_reg.convs.2.weight, decoder_blocks.2.layer_reg.convs.2.bias, decoder_blocks.2.layer_reg.norm_reg.weight, decoder_blocks.2.layer_reg.norm_reg.bias, decoder_bl
ocks.2.layer_reg.fc_reg.weight, decoder_blocks.2.layer_reg.fc_reg.bias, decoder_blocks.3.layer_reg.norms.0.weight, decoder_blocks.3.layer_reg.norms.0.bias, decoder_blocks.3.layer_reg.norms.1.weight, decoder_blocks.3.layer
_reg.norms.1.bias, decoder_blocks.3.layer_reg.norms.2.weight, decoder_blocks.3.layer_reg.norms.2.bias, decoder_blocks.3.layer_reg.convs.0.weight, decoder_blocks.3.layer_reg.convs.0.bias, decoder_blocks.3.layer_reg.convs.1
.weight, decoder_blocks.3.layer_reg.convs.1.bias, decoder_blocks.3.layer_reg.convs.2.weight, decoder_blocks.3.layer_reg.convs.2.bias, decoder_blocks.3.layer_reg.norm_reg.weight, decoder_blocks.3.layer_reg.norm_reg.bias, decoder_blocks.3.layer_reg.fc_reg.weight, decoder_blocks.3.layer_reg.fc_reg.bias, decoder_blocks.4.layer_reg.norms.0.weight, decoder_blocks.4.layer_reg.norms.0.bias, decoder_blocks.4.layer_reg.norms.1.weight, decoder_block
s.4.layer_reg.norms.1.bias, decoder_blocks.4.layer_reg.convs.0.weight, decoder_blocks.4.layer_reg.convs.0.bias, decoder_blocks.4.layer_reg.convs.1.weight, decoder_blocks.4.layer_reg.convs.1.bias, decoder_blocks.4.layer_re
g.norm_reg.weight, decoder_blocks.4.layer_reg.norm_reg.bias, decoder_blocks.4.layer_reg.fc_reg.weight, decoder_blocks.4.layer_reg.fc_reg.bias, decoder_blocks.5.layer_reg.norms.0.weight, decoder_blocks.5.layer_reg.norms.0.
bias, decoder_blocks.5.layer_reg.norms.1.weight, decoder_blocks.5.layer_reg.norms.1.bias, decoder_blocks.5.layer_reg.convs.0.weight, decoder_blocks.5.layer_reg.convs.0.bias, decoder_blocks.5.layer_reg.convs.1.weight, decoder_blocks.5.layer_reg.convs.1.bias, decoder_blocks.5.layer_reg.norm_reg.weight, decoder_blocks.5.layer_reg.norm_reg.bias, decoder_blocks.5.layer_reg.fc_reg.weight, decoder_blocks.5.layer_reg.fc_reg.bias, decoder_blocks.6
.layer_reg.norms.0.weight, decoder_blocks.6.layer_reg.norms.0.bias, decoder_blocks.6.layer_reg.convs.0.weight, decoder_blocks.6.layer_reg.convs.0.bias, decoder_blocks.6.layer_reg.norm_reg.weight, decoder_blocks.6.layer_re
g.norm_reg.bias, decoder_blocks.6.layer_reg.fc_reg.weight, decoder_blocks.6.layer_reg.fc_reg.bias, decoder_blocks.7.layer_reg.norms.0.weight, decoder_blocks.7.layer_reg.norms.0.bias, decoder_blocks.7.layer_reg.convs.0.wei
ght, decoder_blocks.7.layer_reg.convs.0.bias, decoder_blocks.7.layer_reg.norm_reg.weight, decoder_blocks.7.layer_reg.norm_reg.bias, decoder_blocks.7.layer_reg.fc_reg.weight, decoder_blocks.7.layer_reg.fc_reg.bias, decoder_box_norm.weight, decoder_box_norm.bias
我不知道问题出在了哪里。

@yuhongtian17
Copy link
Owner

是这样的,因为epoch_12.pth是按检测器所定义的变量名作为权重存储字典的key的,它与mae pretrained的权重存储字典的key不一致。而检测器定义的init_weight函数是按mae pretrained的字典key去查找对应value和加载权重的。因此epoch_12.pth不能直接作为检测器的预训练模型从头开始全监督预训练,否则就会出现未能成功加载权重、相当于随机初始化的情形;如果需要权重微调,则应按test时的整个模型load_state_dict方式加载权重然后开始训练。

@quxianjiuguo
Copy link

Can I use a single 4090 in this github to train my own dataset?

@yuhongtian17
Copy link
Owner

See issue #7

@WenLinLliu
Copy link

Can I use a single 4090 in this github to train my own dataset?

yes,you can,no problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants