Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

aaronenyeshi · 2021-11-30T18:08:21Z

In the original paper, densenet121 is trained with a batch size of 256 on the ImageNet dataset. They are also training densenet121 on the TitanX GPU with 12GB device memory. However CI's Tesla T4 gets out of memory with 16GB device memory.
Paper: https://arxiv.org/pdf/1608.06993.pdf

In the original repo, DenseNet has an option -optMemory 4 which significantly reduces the memory footprint and is able to train on TitanX. TorchVision's version of densenet121 has memory_efficient option, that still fails to run on T4 (16GB mem).

Repo: https://github.com/liuzhuang13/DenseNet#memory-efficient-implementation-newly-added-feature-on-june-6-2017
Densenet121 Torchvision: https://github.com/pytorch/vision/blob/d367a01a18a3ae6bee13d8be3b63fd6a581ea46f/torchvision/models/densenet.py#L162

Densenet121 train/example on T4 fails with CUDA OOM:

(py37) aaronshi@pytorch-benchmark-dev-0:~/data/as-use/benchmark$ python test.py -k densenet121 --verbose
test_densenet121_check_device_cpu (__main__.TestBenchmark) ... ok
test_densenet121_check_device_cuda (__main__.TestBenchmark) ... ok
test_densenet121_eval_cpu (__main__.TestBenchmark) ... ok
test_densenet121_eval_cuda (__main__.TestBenchmark) ... ok
test_densenet121_example_cpu (__main__.TestBenchmark) ... ok
test_densenet121_example_cuda (__main__.TestBenchmark) ... ERROR
test_densenet121_train_cpu (__main__.TestBenchmark) ... ok
test_densenet121_train_cuda (__main__.TestBenchmark) ... ERROR

Torchvision authors may want to implement -optMemory 4 optimization to allow training on a single device.

The text was updated successfully, but these errors were encountered:

aaronenyeshi · 2021-12-01T00:39:43Z

Hi @fmassa, author of memory efficient densenet121 in torchvision. Do you know if anyone would be interested in taking a look at implementing this -optMemory 4 feature to torchvision's densenet121?

For reference, Memory Efficient PR: pytorch/vision#1003

…city (#781) Summary: When a test is flagged as "NotImplemented", there are actually two cases: 1. The test itself doesn't implement or handle the configs, e.g., unsupervised-learning models like pytorch_struct doesn't have `eval()` tests, or the pyhpc models don't have `train()` tests. 2. The test doesn't support running on our T4 CI GPU machine, but it runs totally fine on other GPUs, such as `V100` or `A100`. This PR is to eliminate the second case, so that the test can still run through `run.py` or `run_sweep.py` interfaces. Instead, we flag the test to be `not_implemented` in the `metadata.yaml`, and the CI script `test.py` or `test_bench.py` will read from the metadata and determine they are not suitable to run on the CI machine. This fixes #688, #626, and #598 Pull Request resolved: #781 Reviewed By: aaronenyeshi Differential Revision: D34786277 Pulled By: xuzhao9 fbshipit-source-id: d5d3d884839345f4fcad21ccf541a02d8e705f5f

xuzhao9 · 2022-03-10T21:30:17Z

Fixed by #781

This was referenced Nov 30, 2021

Improve test.py EXCLUDELIST to disable by model, test, and device #592

Closed

Fix the train batch size of densenet121 #579

Closed

xuzhao9 mentioned this issue Mar 9, 2022

Remove "NotImplemented" flags if the test is limited by hardware capacity #781

Closed

xuzhao9 closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

aaronenyeshi commented Nov 30, 2021 •

edited

Loading

aaronenyeshi commented Dec 1, 2021

xuzhao9 commented Mar 10, 2022

Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

Comments

aaronenyeshi commented Nov 30, 2021 • edited Loading

aaronenyeshi commented Dec 1, 2021

xuzhao9 commented Mar 10, 2022

aaronenyeshi commented Nov 30, 2021 •

edited

Loading