How do you measure train memory? #2

merrymercy · 2022-01-18T01:48:32Z

Thanks for your great work! I have a question about the memory usage.

How do you get the "train memory" in this table https://github.com/zhuang-group/Mesa#results-on-imagenet?
Is it the total memory required for training? Do you get it analytically or empirically using torch.cuda.memory_allocated?

If I understand correctly, Mesa compresses fp16 activations to int8, so it can at most reduce the activation memory by 2x.
There are also other components in total memory, so Mesa cannot reduce the total memory by 2x.
However, in your results, Mesa reduces memory by more than 2x for some models. How is this possible?

The text was updated successfully, but these errors were encountered:

HubHop · 2022-01-18T03:47:56Z

Hi @merrymercy, thanks for your interest!

This is a very good question. Let me explain here.

In our experiments, we measure the training memory usage with torch.cuda.max_memory_allocated(). As a concrete example, you may refer to our provided project for training DeiT here.

With the default setting (PyTorch AMP, batch size 128 and image resolution 224 on ImageNet with a single 3090 GPU) and Mesa disabled, you will see the log like this,

Epoch: [0]  [   10/10008]  eta: 0:58:21  lr: 0.000001  loss: 6.9468 (6.9380)  time: 0.3503  data: 0.1429  max mem: 4171

With the same setting but Mesa enabled, you will see a clear memory reduction,

Epoch: [0]  [   10/10008]  eta: 1:16:11  lr: 0.000001  loss: 6.9406 (6.9372)  time: 0.4572  data: 0.1298  max mem: 1859

As you may be aware, the maximum allocated memory is actually reduced by more than 2x. The reason is that PyTorch AMP is actually a mixed-precision training, which means not all activations are torch.float16. If you print the data type of the input here, you will find the input at some layers are actually torch.float32. Therefore, the default AMP training stores mixed FP16 and FP32 activations, while with Mesa we only save int8 activations, which is why we can achieve more than 2x memory reduction in practice.

Cheers,
Zizheng

merrymercy · 2022-01-18T19:46:56Z

Makes sense. Thanks for the explanation.

HubHop added the question Further information is requested label Jan 18, 2022

merrymercy closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you measure train memory? #2

How do you measure train memory? #2

merrymercy commented Jan 18, 2022 •

edited

HubHop commented Jan 18, 2022 •

edited

merrymercy commented Jan 18, 2022

How do you measure train memory? #2

How do you measure train memory? #2

Comments

merrymercy commented Jan 18, 2022 • edited

HubHop commented Jan 18, 2022 • edited

merrymercy commented Jan 18, 2022

merrymercy commented Jan 18, 2022 •

edited

HubHop commented Jan 18, 2022 •

edited