CUDA Out of Memory Error During ESAM Training


# CUDA Out of Memory Error During ESAM Training

## Issue Description

I encountered a CUDA out of memory error while training the ESAM model. I strictly followed the installation and dataset preparation instructions without making any code modifications. The training process produced numerous `NaN` values and eventually resulted in memory overflow.

## Environment Information

I have completed all dependency installations as specified in `installation.md`:
- Python 3.8
- PyTorch 1.13.1
- CUDA 11.6
- mmengine 0.10.3
- mmdet3d 1.4.0
- mmcv 2.0.0
- mmdet 3.2.0
- MinkowskiEngine 0.5.4

## Dataset Preparation

I have completed the ScanNet200 dataset preparation according to `dataset_preparation.md`, including all ScanNet200-SV related processing.

## Reproduction Steps

I executed the following command to train the ESAM model without modifying any code:
````bash
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/ESAM_CA/ESAM_sv_scannet200_CA.py --work-dir work_dirs/ESAM_sv_scannet200_CA/
````


## Error Information

### CUDA Out of Memory

After running approximately 4 epochs, the following CUDA memory error occurred:

````
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB 
(GPU 0; 23.70 GiB total capacity; 4.84 GiB already allocated; 3.56 MiB free; 
4.94 GiB reserved in total by PyTorch)
````


The error stack trace shows the issue occurred at:
````
File "tools/train.py", line 95, in <module>
  runner.train()
File ".../mmengine/runner/runner.py", line 1238, in train
  self.train_loop.run()
...
File ".../torch/nn/functional.py", line 3136, in binary_cross_entropy_with_logits
  return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
````


### NaN Value Issues

NaN values appeared from the first epoch in almost every iteration:
````
05/15 13:42:18 - mmengine - INFO - Epoch(train) [1][50/320]  lr: 1.0000e-04  eta: 5:12:28  time: 0.6014  data_time: 0.0326  memory: 4838  loss: nan  inst_loss: nan
05/15 13:42:49 - mmengine - INFO - Epoch(train) [1][100/320]  lr: 1.0000e-04  eta: 5:05:17  time: 0.6110  data_time: 0.0331  memory: 4838  loss: nan  inst_loss: nan
05/15 13:43:20 - mmengine - INFO - Epoch(train) [1][150/320]  lr: 1.0000e-04  eta: 5:01:29  time: 0.6224  data_time: 0.0340  memory: 4838  loss: nan  inst_loss: nan
05/15 13:43:51 - mmengine - INFO - Epoch(train) [1][200/320]  lr: 1.0000e-04  eta: 4:58:42  time: 0.6191  data_time: 0.0333  memory: 4838  loss: nan  inst_loss: nan
...
````


These NaN values persisted from the beginning until the memory overflow occurred.

### Model Loading Mismatch

The initial training log showed model loading mismatches:
````
The model and loaded state dict do not match exactly

unexpected key in source state_dict: 
  - mask_features_head.kernel_weight
  - mask_features_head.kernel_bias
  - mask_features_head.decoder.0.weight
  - mask_features_head.decoder.0.bias
  - mask_features_head.decoder.2.weight
  - mask_features_head.decoder.2.bias
  - mask_features_head.mask_features_head.weight
  - mask_features_head.mask_features_head.bias
  - ...

missing keys in source state_dict: 
  - pool.pts_proj1.0.weight
  - pool.pts_proj1.1.weight
  - pool.pts_proj1.1.bias
  - pool.pts_proj1.1.running_mean
  - pool.pts_proj1.1.running_var
  - pool.pts_proj1.1.num_batches_tracked
  - ...
````


## Training Configuration

I used the default configuration file `configs/ESAM_CA/ESAM_sv_scannet200_CA.py` with the following main parameters:
- batch_size: 8
- learning rate: 0.0001
- total epochs: 128

## GPU Information

Output from `nvcc --version`:
````
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
````

## Attempted Solutions

1. Ensured sufficient GPU memory (my GPU has 24576MiB total capacity)
2. Verified correct download and placement of Mask3D checkpoints
3. Tried clearing CUDA cache but the issue persisted
4. Confirmed all code is unmodified from the original repository

## Questions

1. Could the NaN values be related to the model loading mismatch?
2. Is it necessary to modify the batch size or other training parameters?
3. How should the model and loaded state dictionary mismatch warnings be addressed?
4. Should I switch to using a free GPU (GPU 1 or GPU 6) by changing the CUDA_VISIBLE_DEVICES value?

Thank you for your assistance!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out of Memory Error During ESAM Training #46

CUDA Out of Memory Error During ESAM Training

Issue Description

Environment Information

Dataset Preparation

Reproduction Steps

Error Information