Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you measure train memory? #2

Closed
merrymercy opened this issue Jan 18, 2022 · 2 comments
Closed

How do you measure train memory? #2

merrymercy opened this issue Jan 18, 2022 · 2 comments
Labels
question Further information is requested

Comments

@merrymercy
Copy link

merrymercy commented Jan 18, 2022

Thanks for your great work! I have a question about the memory usage.

How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet?
Is it the total memory required for training? Do you get it analytically or empirically using torch.cuda.memory_allocated?

If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x.
There are also other components in total memory, so Mesa cannot reduce the total memory by 2x.
However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?

@HubHop HubHop added the question Further information is requested label Jan 18, 2022
@HubHop
Copy link
Contributor

HubHop commented Jan 18, 2022

Hi @merrymercy, thanks for your interest!

This is a very good question. Let me explain here.

In our experiments, we measure the training memory usage with torch.cuda.max_memory_allocated(). As a concrete example, you may refer to our provided project for training DeiT here.

With the default setting (PyTorch AMP, batch size 128 and image resolution 224 on ImageNet with a single 3090 GPU) and Mesa disabled, you will see the log like this,

Epoch: [0]  [   10/10008]  eta: 0:58:21  lr: 0.000001  loss: 6.9468 (6.9380)  time: 0.3503  data: 0.1429  max mem: 4171

With the same setting but Mesa enabled, you will see a clear memory reduction,

Epoch: [0]  [   10/10008]  eta: 1:16:11  lr: 0.000001  loss: 6.9406 (6.9372)  time: 0.4572  data: 0.1298  max mem: 1859

As you may be aware, the maximum allocated memory is actually reduced by more than 2x. The reason is that PyTorch AMP is actually a mixed-precision training, which means not all activations are torch.float16. If you print the data type of the input here, you will find the input at some layers are actually torch.float32. Therefore, the default AMP training stores mixed FP16 and FP32 activations, while with Mesa we only save int8 activations, which is why we can achieve more than 2x memory reduction in practice.

Cheers,
Zizheng

@merrymercy
Copy link
Author

Makes sense. Thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants