-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On the small model, the actual GPU memory usage of Mamba2 is much higher than that of Mamba1. #439
Comments
nn.Conv1d is probably not great for memory usage. You should try to use causal_conv1d. |
Okay, thank you for your reply,I am trying to use casual_conv1d on the same scale, keeping all other parameters unchanged, and only changing the head_dim to 4 (to meet the requirement of d_state/head_dim% 8==0). In my understanding, this should actually reduce the number of parameters (similar to DWConv?), but now the actual GPU memory usage is 8.12GB, which is far more than Mamba1's 4.82GB memory usage (in my task). I am not sure if this is due to warm up optimization or the SSD's built-in semi separation matrix basic load. Could you please give me some guidance? Thank you very much! |
huh there's no requirement d_state / head_dim % 8 == 0 |
Oh! Yes, it should be d_model/head_dim% 8==0. I looked at the source code again and found that I had confused d_model with d_state before. I'm really sorry. I will try a high-dimensional experiment next and give you feedback later. Additionally, there is an issue with the code in your mamba2 source code that uses "nn. Conv1d" with a padding scheme (to replace casual_conv1d). Please refer to another issue I raised for details #437 . |
Hello, I have continued the experiment on a small dimension and the model parameter is d_model=128, d_state=32, d_conv=4, expand=2, and head_dim=32, Stack a total of 4 layers of encoders(emphasizing the use of casual_conv1d instead of "nn. Conv1d", and due to device limitations, I am unable to conduct experiments in the only dimension of 1028). Under the same parameters as much as possible, in my task, the flops of the Mamba2 model are 60083328.0, with a parameter count of 876256. The actual memory usage of Mamba2 is 16.32GB, and the training time is 265.22s. The flops of the Mamba1 model are 281728.0, with a parameter count of 953728. The actual memory usage of Mamba1 is 13.35G, and the model training time is 586.78s. I don't quite understand why Mamba2 has a smaller actual parameter count but higher memory usage, while Mamba1 has a smaller actual floats but slower inference time. From your experiments on high-dimensional dimensions, it seems that Mamba2's memory usage should be smaller. I suspect that it may be because Mamba2 requires some optimization based on triton programming, resulting in higher memory usage? Also, is it caused by the basic load of the semi separation matrix in the SSD architecture? |
If possible, could you release an official version of an SSD based on Python instead of Triton? It seems that ssd_minimal does not have a discretization based implementation, and due to personal limitations, I cannot guarantee that I can achieve an equivalent version of your triton implementation. Thank you very much! |
We already have a reference implementation: mamba/mamba_ssm/ops/triton/ssd_combined.py Line 621 in 8ffd905
|
Thank you for your great work. I am currently trying to use ssd_chunk_scan_combined_ref, but there is an assertion in chunk_scan_ref (line 1806 in ssd_chunk_state.py): |
You can always pad the seqlen. The assert |
Thank you for your prompt reply! I have seen the code related to pad the seqlen, and I will continue to study the relevant content. In addition, in my task, it seems that under the same number of parameters, Mamba1 always performs better than Mamba2. However, as the feature dimension increases, Mamba2's training and inference speed advantages will become larger, and the difference in memory usage between Mamba2 and Mamba1 will also become smaller (Mamba2 memory usage will be larger in low dimensions). I have been testing until the feature dimension of 512, and the memory usage of Mamba2 and Mamba1 will be almost equal. By adjusting chunk_size, I found that the usage of graphics memory significantly changes with the number of partitions, so I think the disadvantage of Mamba2 in low dimensions should involve the basic consumption problem of semi separation matrices. If you also agree with my immature ideas, may I ask if Mamba2 can be partially optimized in this regard? |
how did you accelerate mamba2? warm up strategy? |
yes |
how can u caclulate mamba2 flops ? |
The recbole.utils.get_flops method in the recbole framework. |
did recbole take the triton ops flops into account? |
Actually, I'm not quite sure why it's said that ssd_minimal doesn't have a discrete implementation. Additionally, I tested the accuracy difference between ssd_minimal_discrete and mamba_chunk_scan_combined, and the maximum difference can reach 0.05. Why is this the case? |
From the code implementation provided by the official blog and code, it can be seen that ssd_minimal only considers the calculations required by the ssd kernel, without taking into account the calculation steps for discretizing the A and B matrices. Therefore, ssd_minimal does not have a discretization process. |
The parameters of the Mamba2 model are d_state=32, d_conv=4, expand=2, and head_dim=32 (using "nn. Conv1d" with padding method, without the constraint of d_model/head_dim%8==0). Mamba1 maintains the same parameters except for the absence of head_dim. Although the inference speed of Mamba2 has almost doubled compared to Mamba1, the actual memory usage has increased from 4.82G to 7.55G (in my task). I would like to ask if this is due to the basic computational load of Mamba2's semi separation matrix, which poses a disadvantage in small-scale models? I see in your paper that on larger scale models, the actual memory usage of Mamba2 is lower.
The text was updated successfully, but these errors were encountered: