Skip to content

CUDA Out of Memory Error During ESAM Training #46

@Xin200203

Description

@Xin200203

CUDA Out of Memory Error During ESAM Training

Issue Description

I encountered a CUDA out of memory error while training the ESAM model. I strictly followed the installation and dataset preparation instructions without making any code modifications. The training process produced numerous NaN values and eventually resulted in memory overflow.

Environment Information

I have completed all dependency installations as specified in installation.md:

  • Python 3.8
  • PyTorch 1.13.1
  • CUDA 11.6
  • mmengine 0.10.3
  • mmdet3d 1.4.0
  • mmcv 2.0.0
  • mmdet 3.2.0
  • MinkowskiEngine 0.5.4

Dataset Preparation

I have completed the ScanNet200 dataset preparation according to dataset_preparation.md, including all ScanNet200-SV related processing.

Reproduction Steps

I executed the following command to train the ESAM model without modifying any code:

CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/ESAM_CA/ESAM_sv_scannet200_CA.py --work-dir work_dirs/ESAM_sv_scannet200_CA/

Error Information

CUDA Out of Memory

After running approximately 4 epochs, the following CUDA memory error occurred:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB 
(GPU 0; 23.70 GiB total capacity; 4.84 GiB already allocated; 3.56 MiB free; 
4.94 GiB reserved in total by PyTorch)

The error stack trace shows the issue occurred at:

File "tools/train.py", line 95, in <module>
  runner.train()
File ".../mmengine/runner/runner.py", line 1238, in train
  self.train_loop.run()
...
File ".../torch/nn/functional.py", line 3136, in binary_cross_entropy_with_logits
  return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

NaN Value Issues

NaN values appeared from the first epoch in almost every iteration:

05/15 13:42:18 - mmengine - INFO - Epoch(train) [1][50/320]  lr: 1.0000e-04  eta: 5:12:28  time: 0.6014  data_time: 0.0326  memory: 4838  loss: nan  inst_loss: nan
05/15 13:42:49 - mmengine - INFO - Epoch(train) [1][100/320]  lr: 1.0000e-04  eta: 5:05:17  time: 0.6110  data_time: 0.0331  memory: 4838  loss: nan  inst_loss: nan
05/15 13:43:20 - mmengine - INFO - Epoch(train) [1][150/320]  lr: 1.0000e-04  eta: 5:01:29  time: 0.6224  data_time: 0.0340  memory: 4838  loss: nan  inst_loss: nan
05/15 13:43:51 - mmengine - INFO - Epoch(train) [1][200/320]  lr: 1.0000e-04  eta: 4:58:42  time: 0.6191  data_time: 0.0333  memory: 4838  loss: nan  inst_loss: nan
...

These NaN values persisted from the beginning until the memory overflow occurred.

Model Loading Mismatch

The initial training log showed model loading mismatches:

The model and loaded state dict do not match exactly

unexpected key in source state_dict: 
  - mask_features_head.kernel_weight
  - mask_features_head.kernel_bias
  - mask_features_head.decoder.0.weight
  - mask_features_head.decoder.0.bias
  - mask_features_head.decoder.2.weight
  - mask_features_head.decoder.2.bias
  - mask_features_head.mask_features_head.weight
  - mask_features_head.mask_features_head.bias
  - ...

missing keys in source state_dict: 
  - pool.pts_proj1.0.weight
  - pool.pts_proj1.1.weight
  - pool.pts_proj1.1.bias
  - pool.pts_proj1.1.running_mean
  - pool.pts_proj1.1.running_var
  - pool.pts_proj1.1.num_batches_tracked
  - ...

Training Configuration

I used the default configuration file configs/ESAM_CA/ESAM_sv_scannet200_CA.py with the following main parameters:

  • batch_size: 8
  • learning rate: 0.0001
  • total epochs: 128

GPU Information

Output from nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

Attempted Solutions

  1. Ensured sufficient GPU memory (my GPU has 24576MiB total capacity)
  2. Verified correct download and placement of Mask3D checkpoints
  3. Tried clearing CUDA cache but the issue persisted
  4. Confirmed all code is unmodified from the original repository

Questions

  1. Could the NaN values be related to the model loading mismatch?
  2. Is it necessary to modify the batch size or other training parameters?
  3. How should the model and loaded state dictionary mismatch warnings be addressed?
  4. Should I switch to using a free GPU (GPU 1 or GPU 6) by changing the CUDA_VISIBLE_DEVICES value?

Thank you for your assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions