On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439

AlwaysFHao · 2024-07-02T09:17:46Z

The parameters of the Mamba2 model are d_state=32, d_conv=4, expand=2, and head_dim=32 (using "nn. Conv1d" with padding method, without the constraint of d_model/head_dim%8==0). Mamba1 maintains the same parameters except for the absence of head_dim. Although the inference speed of Mamba2 has almost doubled compared to Mamba1, the actual memory usage has increased from 4.82G to 7.55G (in my task). I would like to ask if this is due to the basic computational load of Mamba2's semi separation matrix, which poses a disadvantage in small-scale models? I see in your paper that on larger scale models, the actual memory usage of Mamba2 is lower.

tridao · 2024-07-02T17:24:53Z

nn.Conv1d is probably not great for memory usage. You should try to use causal_conv1d.

AlwaysFHao · 2024-07-03T03:21:55Z

Okay, thank you for your reply，I am trying to use casual_conv1d on the same scale, keeping all other parameters unchanged, and only changing the head_dim to 4 (to meet the requirement of d_state/head_dim% 8==0). In my understanding, this should actually reduce the number of parameters (similar to DWConv?), but now the actual GPU memory usage is 8.12GB, which is far more than Mamba1's 4.82GB memory usage (in my task). I am not sure if this is due to warm up optimization or the SSD's built-in semi separation matrix basic load. Could you please give me some guidance? Thank you very much!

tridao · 2024-07-03T05:32:02Z

huh there's no requirement d_state / head_dim % 8 == 0
there's d_model / head_dim % 8 == 0
you can try the dimensions similar to the language models we've released (e.g. d_model = 1024)
I don't have much experience with the kind of small dimensions you're working with

AlwaysFHao · 2024-07-03T06:34:40Z

Oh! Yes, it should be d_model/head_dim% 8==0. I looked at the source code again and found that I had confused d_model with d_state before. I'm really sorry. I will try a high-dimensional experiment next and give you feedback later. Additionally, there is an issue with the code in your mamba2 source code that uses "nn. Conv1d" with a padding scheme (to replace casual_conv1d). Please refer to another issue I raised for details #437 .

AlwaysFHao · 2024-07-04T06:05:37Z

Hello, I have continued the experiment on a small dimension and the model parameter is d_model=128, d_state=32, d_conv=4, expand=2, and head_dim=32， Stack a total of 4 layers of encoders(emphasizing the use of casual_conv1d instead of "nn. Conv1d", and due to device limitations, I am unable to conduct experiments in the only dimension of 1028). Under the same parameters as much as possible, in my task, the flops of the Mamba2 model are 60083328.0, with a parameter count of 876256. The actual memory usage of Mamba2 is 16.32GB, and the training time is 265.22s. The flops of the Mamba1 model are 281728.0, with a parameter count of 953728. The actual memory usage of Mamba1 is 13.35G, and the model training time is 586.78s. I don't quite understand why Mamba2 has a smaller actual parameter count but higher memory usage, while Mamba1 has a smaller actual floats but slower inference time. From your experiments on high-dimensional dimensions, it seems that Mamba2's memory usage should be smaller. I suspect that it may be because Mamba2 requires some optimization based on triton programming, resulting in higher memory usage? Also, is it caused by the basic load of the semi separation matrix in the SSD architecture?

AlwaysFHao · 2024-07-04T06:14:03Z

If possible, could you release an official version of an SSD based on Python instead of Triton? It seems that ssd_minimal does not have a discretization based implementation, and due to personal limitations, I cannot guarantee that I can achieve an equivalent version of your triton implementation. Thank you very much!

tridao · 2024-07-04T07:01:51Z

We already have a reference implementation:

mamba/mamba_ssm/ops/triton/ssd_combined.py

Line 621 in 8ffd905

    
           def ssd_chunk_scan_combined_ref(x, dt, A, B, C, chunk_size, D=None, z=None, dt_bias=None, dt_softplus=False):

AlwaysFHao · 2024-07-05T05:48:50Z

Thank you for your great work. I am currently trying to use ssd_chunk_scan_combined_ref, but there is an assertion in chunk_scan_ref (line 1806 in ssd_chunk_state.py):
assert seqlen == nchunks * chunk_size
I don't quite understand why this restriction doesn't seem to be present when using the triton version? According to the SSD architecture, there should indeed be this limitation, but it seems that I did not consider this issue when using the Triton version before. Did you add any optimization methods in the Triton version? Could you please help me dispel my doubts? Thank you very much!

tridao · 2024-07-05T06:38:22Z

You can always pad the seqlen. The assert seqlen == nchunks * chunk_size for the reference is there for simplicity of implemtatnion. This ref version is not used to train models, only for testing.
The triton version implicitly pads things inside the kernel it supports all kinds of seqlen.

AlwaysFHao · 2024-07-05T07:17:19Z

Thank you for your prompt reply! I have seen the code related to pad the seqlen, and I will continue to study the relevant content. In addition, in my task, it seems that under the same number of parameters, Mamba1 always performs better than Mamba2. However, as the feature dimension increases, Mamba2's training and inference speed advantages will become larger, and the difference in memory usage between Mamba2 and Mamba1 will also become smaller (Mamba2 memory usage will be larger in low dimensions). I have been testing until the feature dimension of 512, and the memory usage of Mamba2 and Mamba1 will be almost equal. By adjusting chunk_size, I found that the usage of graphics memory significantly changes with the number of partitions, so I think the disadvantage of Mamba2 in low dimensions should involve the basic consumption problem of semi separation matrices. If you also agree with my immature ideas, may I ask if Mamba2 can be partially optimized in this regard?

TimothyChen225 · 2024-07-06T11:34:12Z

how did you accelerate mamba2? warm up strategy?

AlwaysFHao · 2024-07-06T12:37:10Z

yes

dumpmemory · 2024-07-25T15:29:59Z

how can u caclulate mamba2 flops ？

AlwaysFHao · 2024-07-31T09:12:12Z

The recbole.utils.get_flops method in the recbole framework.
https://github.com/RUCAIBox/RecBole/blob/2b6e209372a1a666fe7207e6c2a96c7c3d49b427/recbole/utils/utils.py#L250

dumpmemory · 2024-07-31T10:44:07Z

did recbole take the triton ops flops into account?

chairman-lu · 2024-08-28T10:51:21Z

Actually, I'm not quite sure why it's said that ssd_minimal doesn't have a discrete implementation. Additionally, I tested the accuracy difference between ssd_minimal_discrete and mamba_chunk_scan_combined, and the maximum difference can reach 0.05. Why is this the case?

AlwaysFHao · 2024-08-28T11:06:49Z

From the code implementation provided by the official blog and code, it can be seen that ssd_minimal only considers the calculations required by the ssd kernel, without taking into account the calculation steps for discretizing the A and B matrices. Therefore, ssd_minimal does not have a discretization process.

AlwaysFHao mentioned this issue Jul 6, 2024

Questions about Chunk_size using Triton optimization in SSD kernel #449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439

On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439

AlwaysFHao commented Jul 2, 2024 •

edited

Loading

tridao commented Jul 2, 2024

AlwaysFHao commented Jul 3, 2024

tridao commented Jul 3, 2024

AlwaysFHao commented Jul 3, 2024

AlwaysFHao commented Jul 4, 2024

AlwaysFHao commented Jul 4, 2024

tridao commented Jul 4, 2024

AlwaysFHao commented Jul 5, 2024

tridao commented Jul 5, 2024

AlwaysFHao commented Jul 5, 2024

TimothyChen225 commented Jul 6, 2024

AlwaysFHao commented Jul 6, 2024

dumpmemory commented Jul 25, 2024

AlwaysFHao commented Jul 31, 2024

dumpmemory commented Jul 31, 2024

chairman-lu commented Aug 28, 2024

AlwaysFHao commented Aug 28, 2024 •

edited

Loading

On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439

On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439

Comments

AlwaysFHao commented Jul 2, 2024 • edited Loading

tridao commented Jul 2, 2024

AlwaysFHao commented Jul 3, 2024

tridao commented Jul 3, 2024

AlwaysFHao commented Jul 3, 2024

AlwaysFHao commented Jul 4, 2024

AlwaysFHao commented Jul 4, 2024

tridao commented Jul 4, 2024

AlwaysFHao commented Jul 5, 2024

tridao commented Jul 5, 2024

AlwaysFHao commented Jul 5, 2024

TimothyChen225 commented Jul 6, 2024

AlwaysFHao commented Jul 6, 2024

dumpmemory commented Jul 25, 2024

AlwaysFHao commented Jul 31, 2024

dumpmemory commented Jul 31, 2024

chairman-lu commented Aug 28, 2024

AlwaysFHao commented Aug 28, 2024 • edited Loading

AlwaysFHao commented Jul 2, 2024 •

edited

Loading

AlwaysFHao commented Aug 28, 2024 •

edited

Loading