Variable length generation does not support padding tokens

Variable length generation requires `seq_idx` and `cu_seqlens` to span the full input sequence length but in some cases I would want to include padding tokens (e.g. for maintaining static shapes to use CUDA graphs or to ensure power-of-two input dimensions). If I try to get around this limitation by treating the padding tokens as a dummy request I see that the outputs of the non-padded tokens are affected by the padded tokens. I have included a simple [test_script](https://github.com/user-attachments/files/22053162/mamba2_padding_test.py) to verify this behavior; the script tests that 1) ignoring padding tokens in `seq_idx` and `cu_seqlens` throws a shape error, and 2) performing computation on padding tokens can influence the results of non-padding tokens. Here is the output produced by the test script (using weights from `NVIDIA-Nemotron-Nano-12B-v2-Base`):
```
==================== 🧪 SCENARIO 1: Ignoring Padding Tokens ====================
Goal: Show that an error occurs if `hidden_states` contains padding tokens that are not accounted for in `cu_seqlens`.

Input `hidden_states` shape: torch.Size([1, 128, 5120])
`cu_seqlens` created for real requests: tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
The last value in `cu_seqlens` is 37, but the input tensor has 128 tokens. This mismatch is expected to cause an error.

Attempting forward pass with mismatched metadata...
🔴 FAILED: The forward pass threw an error.
   Error Type: RuntimeError
   Error Message: seq_idx must have shape (batch_size, seqlen)

This confirms that the total length of the input tensor must exactly match the token count described by `cu_seqlens`.

==================== 🧪 SCENARIO 2: Padding Tokens Influence Output ====================
Goal: Compare the model's output for a set of requests with and without an additional padding sequence to show that padding influences the result.


--- 🔬 Run 1: Forward pass without any padding ---
Input shape (no padding): torch.Size([1, 37, 5120])
`cu_seqlens` (no padding): tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
Forward pass without padding complete.

--- 🔭 Run 2: Forward pass with a padding sequence ---
Input shape (with padding): torch.Size([1, 128, 5120])
`cu_seqlens` (with padding): tensor([  0,  10,  15,  37, 128], device='cuda:0', dtype=torch.int32)
Forward pass with padding complete.

--- ✅ Comparison Results ---
🔴 ERROR: The outputs for the real tokens are DIFFERENT.
   -> Max absolute difference: 0.003906

This demonstrates that the presence of a padding sequence (even when treated as a separate request) influences the computation for the real tokens.
```

You can run the test script as follows:
```
HF_HOME=/path/to/hf_home python mamba2_padding_test.py
```

I tested on a single H100 GPU with version 2.2.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variable length generation does not support padding tokens #783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Variable length generation does not support padding tokens #783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions