You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work! I have a question about the memory usage.
How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet?
Is it the total memory required for training? Do you get it analytically or empirically using torch.cuda.memory_allocated?
If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x.
There are also other components in total memory, so Mesa cannot reduce the total memory by 2x.
However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?
The text was updated successfully, but these errors were encountered:
This is a very good question. Let me explain here.
In our experiments, we measure the training memory usage with torch.cuda.max_memory_allocated(). As a concrete example, you may refer to our provided project for training DeiT here.
With the default setting (PyTorch AMP, batch size 128 and image resolution 224 on ImageNet with a single 3090 GPU) and Mesa disabled, you will see the log like this,
As you may be aware, the maximum allocated memory is actually reduced by more than 2x. The reason is that PyTorch AMP is actually a mixed-precision training, which means not all activations are torch.float16. If you print the data type of the input here, you will find the input at some layers are actually torch.float32. Therefore, the default AMP training stores mixed FP16 and FP32 activations, while with Mesa we only save int8 activations, which is why we can achieve more than 2x memory reduction in practice.
Thanks for your great work! I have a question about the memory usage.
How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet?
Is it the total memory required for training? Do you get it analytically or empirically using
torch.cuda.memory_allocated
?If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x.
There are also other components in total memory, so Mesa cannot reduce the total memory by 2x.
However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?
The text was updated successfully, but these errors were encountered: