Skip to content

Variable length generation does not support padding tokens #783

@santhnm2

Description

@santhnm2

Variable length generation requires seq_idx and cu_seqlens to span the full input sequence length but in some cases I would want to include padding tokens (e.g. for maintaining static shapes to use CUDA graphs or to ensure power-of-two input dimensions). If I try to get around this limitation by treating the padding tokens as a dummy request I see that the outputs of the non-padded tokens are affected by the padded tokens. I have included a simple test_script to verify this behavior; the script tests that 1) ignoring padding tokens in seq_idx and cu_seqlens throws a shape error, and 2) performing computation on padding tokens can influence the results of non-padding tokens. Here is the output produced by the test script (using weights from NVIDIA-Nemotron-Nano-12B-v2-Base):

==================== 🧪 SCENARIO 1: Ignoring Padding Tokens ====================
Goal: Show that an error occurs if `hidden_states` contains padding tokens that are not accounted for in `cu_seqlens`.

Input `hidden_states` shape: torch.Size([1, 128, 5120])
`cu_seqlens` created for real requests: tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
The last value in `cu_seqlens` is 37, but the input tensor has 128 tokens. This mismatch is expected to cause an error.

Attempting forward pass with mismatched metadata...
🔴 FAILED: The forward pass threw an error.
   Error Type: RuntimeError
   Error Message: seq_idx must have shape (batch_size, seqlen)

This confirms that the total length of the input tensor must exactly match the token count described by `cu_seqlens`.

==================== 🧪 SCENARIO 2: Padding Tokens Influence Output ====================
Goal: Compare the model's output for a set of requests with and without an additional padding sequence to show that padding influences the result.


--- 🔬 Run 1: Forward pass without any padding ---
Input shape (no padding): torch.Size([1, 37, 5120])
`cu_seqlens` (no padding): tensor([ 0, 10, 15, 37], device='cuda:0', dtype=torch.int32)
Forward pass without padding complete.

--- 🔭 Run 2: Forward pass with a padding sequence ---
Input shape (with padding): torch.Size([1, 128, 5120])
`cu_seqlens` (with padding): tensor([  0,  10,  15,  37, 128], device='cuda:0', dtype=torch.int32)
Forward pass with padding complete.

--- ✅ Comparison Results ---
🔴 ERROR: The outputs for the real tokens are DIFFERENT.
   -> Max absolute difference: 0.003906

This demonstrates that the presence of a padding sequence (even when treated as a separate request) influences the computation for the real tokens.

You can run the test script as follows:

HF_HOME=/path/to/hf_home python mamba2_padding_test.py

I tested on a single H100 GPU with version 2.2.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions