-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Variable length generation requires seq_idx and cu_seqlens to span the full input sequence length but in some cases I would want to include padding tokens (e.g. for maintaining static shapes to use CUDA graphs or to ensure power-of-two input dimensions). If I try to get around this limitation by treating the padding tokens as a dummy request I see that the outputs of the non-padded tokens are affected by the padded tokens. I have included a simple test_script to verify this behavior; the script tests that 1) ignoring padding tokens in seq_idx and cu_seqlens throws a shape error, and 2) performing computation on padding tokens can influence the results of non-padding tokens. Here is the output produced by the test script (using weights from NVIDIA-Nemotron-Nano-12B-v2-Base):
==================== 🧪 SCENARIO 1: Ignoring Padding Tokens ====================
Goal: Show that an error occurs if `hidden_states` contains padding tokens that are not accounted for in `cu_seqlens`.
Input `hidden_states` shape: torch.Size([1, 128, 5120])
`cu_seqlens` created for real requests: tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
The last value in `cu_seqlens` is 37, but the input tensor has 128 tokens. This mismatch is expected to cause an error.
Attempting forward pass with mismatched metadata...
🔴 FAILED: The forward pass threw an error.
Error Type: RuntimeError
Error Message: seq_idx must have shape (batch_size, seqlen)
This confirms that the total length of the input tensor must exactly match the token count described by `cu_seqlens`.
==================== 🧪 SCENARIO 2: Padding Tokens Influence Output ====================
Goal: Compare the model's output for a set of requests with and without an additional padding sequence to show that padding influences the result.
--- 🔬 Run 1: Forward pass without any padding ---
Input shape (no padding): torch.Size([1, 37, 5120])
`cu_seqlens` (no padding): tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
Forward pass without padding complete.
--- 🔭 Run 2: Forward pass with a padding sequence ---
Input shape (with padding): torch.Size([1, 128, 5120])
`cu_seqlens` (with padding): tensor([ 0, 10, 15, 37, 128], device='cuda:0', dtype=torch.int32)
Forward pass with padding complete.
--- ✅ Comparison Results ---
🔴 ERROR: The outputs for the real tokens are DIFFERENT.
-> Max absolute difference: 0.003906
This demonstrates that the presence of a padding sequence (even when treated as a separate request) influences the computation for the real tokens.
You can run the test script as follows:
HF_HOME=/path/to/hf_home python mamba2_padding_test.py
I tested on a single H100 GPU with version 2.2.5.