Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Densenet121: Training with batch size 256 will encounter CUDA OOM on CI's Tesla T4 #598

Closed
aaronenyeshi opened this issue Nov 30, 2021 · 2 comments

Comments

@aaronenyeshi
Copy link
Member

aaronenyeshi commented Nov 30, 2021

In the original paper, densenet121 is trained with a batch size of 256 on the ImageNet dataset. They are also training densenet121 on the TitanX GPU with 12GB device memory. However CI's Tesla T4 gets out of memory with 16GB device memory.
Paper: https://arxiv.org/pdf/1608.06993.pdf

In the original repo, DenseNet has an option -optMemory 4 which significantly reduces the memory footprint and is able to train on TitanX. TorchVision's version of densenet121 has memory_efficient option, that still fails to run on T4 (16GB mem).

Repo: https://github.com/liuzhuang13/DenseNet#memory-efficient-implementation-newly-added-feature-on-june-6-2017
Densenet121 Torchvision: https://github.com/pytorch/vision/blob/d367a01a18a3ae6bee13d8be3b63fd6a581ea46f/torchvision/models/densenet.py#L162

Densenet121 train/example on T4 fails with CUDA OOM:

(py37) aaronshi@pytorch-benchmark-dev-0:~/data/as-use/benchmark$ python test.py -k densenet121 --verbose
test_densenet121_check_device_cpu (__main__.TestBenchmark) ... ok
test_densenet121_check_device_cuda (__main__.TestBenchmark) ... ok
test_densenet121_eval_cpu (__main__.TestBenchmark) ... ok
test_densenet121_eval_cuda (__main__.TestBenchmark) ... ok
test_densenet121_example_cpu (__main__.TestBenchmark) ... ok
test_densenet121_example_cuda (__main__.TestBenchmark) ... ERROR
test_densenet121_train_cpu (__main__.TestBenchmark) ... ok
test_densenet121_train_cuda (__main__.TestBenchmark) ... ERROR

Torchvision authors may want to implement -optMemory 4 optimization to allow training on a single device.

@aaronenyeshi
Copy link
Member Author

Hi @fmassa, author of memory efficient densenet121 in torchvision. Do you know if anyone would be interested in taking a look at implementing this -optMemory 4 feature to torchvision's densenet121?

For reference, Memory Efficient PR: pytorch/vision#1003

facebook-github-bot pushed a commit that referenced this issue Mar 10, 2022
…city (#781)

Summary:
When a test is flagged as "NotImplemented", there are actually two cases:
1. The test itself doesn't implement or handle the configs, e.g., unsupervised-learning models like pytorch_struct doesn't have `eval()` tests, or the pyhpc models don't have `train()` tests.
2. The test doesn't support running on our T4 CI GPU machine, but it runs totally fine on other GPUs, such as `V100` or `A100`.

This PR is to eliminate the second case, so that the test can still run through `run.py` or `run_sweep.py` interfaces. Instead, we flag the test to be `not_implemented` in the `metadata.yaml`, and the CI script `test.py` or `test_bench.py` will read from the metadata and determine they are not suitable to run on the CI machine.

This fixes #688, #626, and #598

Pull Request resolved: #781

Reviewed By: aaronenyeshi

Differential Revision: D34786277

Pulled By: xuzhao9

fbshipit-source-id: d5d3d884839345f4fcad21ccf541a02d8e705f5f
@xuzhao9
Copy link
Contributor

xuzhao9 commented Mar 10, 2022

Fixed by #781

@xuzhao9 xuzhao9 closed this as completed Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants