-
Notifications
You must be signed in to change notification settings - Fork 31
Description
CUDA Out of Memory Error During ESAM Training
Issue Description
I encountered a CUDA out of memory error while training the ESAM model. I strictly followed the installation and dataset preparation instructions without making any code modifications. The training process produced numerous NaN values and eventually resulted in memory overflow.
Environment Information
I have completed all dependency installations as specified in installation.md:
- Python 3.8
- PyTorch 1.13.1
- CUDA 11.6
- mmengine 0.10.3
- mmdet3d 1.4.0
- mmcv 2.0.0
- mmdet 3.2.0
- MinkowskiEngine 0.5.4
Dataset Preparation
I have completed the ScanNet200 dataset preparation according to dataset_preparation.md, including all ScanNet200-SV related processing.
Reproduction Steps
I executed the following command to train the ESAM model without modifying any code:
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/ESAM_CA/ESAM_sv_scannet200_CA.py --work-dir work_dirs/ESAM_sv_scannet200_CA/Error Information
CUDA Out of Memory
After running approximately 4 epochs, the following CUDA memory error occurred:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB
(GPU 0; 23.70 GiB total capacity; 4.84 GiB already allocated; 3.56 MiB free;
4.94 GiB reserved in total by PyTorch)
The error stack trace shows the issue occurred at:
File "tools/train.py", line 95, in <module>
runner.train()
File ".../mmengine/runner/runner.py", line 1238, in train
self.train_loop.run()
...
File ".../torch/nn/functional.py", line 3136, in binary_cross_entropy_with_logits
return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
NaN Value Issues
NaN values appeared from the first epoch in almost every iteration:
05/15 13:42:18 - mmengine - INFO - Epoch(train) [1][50/320] lr: 1.0000e-04 eta: 5:12:28 time: 0.6014 data_time: 0.0326 memory: 4838 loss: nan inst_loss: nan
05/15 13:42:49 - mmengine - INFO - Epoch(train) [1][100/320] lr: 1.0000e-04 eta: 5:05:17 time: 0.6110 data_time: 0.0331 memory: 4838 loss: nan inst_loss: nan
05/15 13:43:20 - mmengine - INFO - Epoch(train) [1][150/320] lr: 1.0000e-04 eta: 5:01:29 time: 0.6224 data_time: 0.0340 memory: 4838 loss: nan inst_loss: nan
05/15 13:43:51 - mmengine - INFO - Epoch(train) [1][200/320] lr: 1.0000e-04 eta: 4:58:42 time: 0.6191 data_time: 0.0333 memory: 4838 loss: nan inst_loss: nan
...
These NaN values persisted from the beginning until the memory overflow occurred.
Model Loading Mismatch
The initial training log showed model loading mismatches:
The model and loaded state dict do not match exactly
unexpected key in source state_dict:
- mask_features_head.kernel_weight
- mask_features_head.kernel_bias
- mask_features_head.decoder.0.weight
- mask_features_head.decoder.0.bias
- mask_features_head.decoder.2.weight
- mask_features_head.decoder.2.bias
- mask_features_head.mask_features_head.weight
- mask_features_head.mask_features_head.bias
- ...
missing keys in source state_dict:
- pool.pts_proj1.0.weight
- pool.pts_proj1.1.weight
- pool.pts_proj1.1.bias
- pool.pts_proj1.1.running_mean
- pool.pts_proj1.1.running_var
- pool.pts_proj1.1.num_batches_tracked
- ...
Training Configuration
I used the default configuration file configs/ESAM_CA/ESAM_sv_scannet200_CA.py with the following main parameters:
- batch_size: 8
- learning rate: 0.0001
- total epochs: 128
GPU Information
Output from nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
Attempted Solutions
- Ensured sufficient GPU memory (my GPU has 24576MiB total capacity)
- Verified correct download and placement of Mask3D checkpoints
- Tried clearing CUDA cache but the issue persisted
- Confirmed all code is unmodified from the original repository
Questions
- Could the NaN values be related to the model loading mismatch?
- Is it necessary to modify the batch size or other training parameters?
- How should the model and loaded state dictionary mismatch warnings be addressed?
- Should I switch to using a free GPU (GPU 1 or GPU 6) by changing the CUDA_VISIBLE_DEVICES value?
Thank you for your assistance!