Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device-side assertion triggered on Batch.prepare_for_decode, release v0.1.16 #461

Closed
noah-kim-theori opened this issue May 22, 2024 · 3 comments
Labels

Comments

@noah-kim-theori
Copy link
Contributor

noah-kim-theori commented May 22, 2024

On sglang.srt.managers.router.infer_batch.Batch, Batch.prepare_for_decode triggers a device-side assertion.

  • model=Command-R-v01, AWQ 4bit quantized
  • max_new_tokens=32384
  • mem_fraction_static=0.6 (on single A100, 81920MiB VRAM)
  • on Regex-constrained decoding

Issue point was at

self.req_to_token_pool.req_to_token[

Below is the crash log; however, apologize that reproduction is not provided.

new fill batch. #seq: 1. #cached_token: 8110. #new_token: 3. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.66%.
new fill batch. #seq: 1. #cached_token: 8153. #new_token: 3. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.66%.
new fill batch. #seq: 1. #cached_token: 8162. #new_token: 3. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.66%.
new fill batch. #seq: 1. #cached_token: 8177. #new_token: 4. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.66%.
new fill batch. #seq: 1. #cached_token: 8182. #new_token: 3. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.67%.
new fill batch. #seq: 1. #cached_token: 8189. #new_token: 3. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 99.67%.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/noah/sglang/python/sglang/srt/managers/router/model_rpc.py", line 213, in exposed_step
    self.forward_step()
  File "/home/noah/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/noah/sglang/python/sglang/srt/managers/router/model_rpc.py", line 248, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/home/noah/sglang/python/sglang/srt/managers/router/model_rpc.py", line 566, in forward_decode_batch
    batch.prepare_for_decode()
  File "/home/noah/sglang/python/sglang/srt/managers/router/infer_batch.py", line 432, in prepare_for_decode
    self.req_to_token_pool.req_to_token[
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@noah-kim-theori
Copy link
Contributor Author

I discovered that in sglang.srt.model_config.ModelConfig, the default value of maximum context length is borrowed from the HuggingFace configurations.

The function sglang.srt.hf_transformers_utils.get_context_length checks for the existence of candidates in the following order: max_sequence_length > seq_length > max_position_embeddings > max_seq_len > model_max_length. Since Cohere Command-R v01 has a 128k context length but also an 8k positional embedding, sglang assumes the model has an 8k context length.

Consequently, mis-indexing of Batch.req_to_token_pool.req_to_token occurred when it tried to generate more than 8k tokens.

To fix this issue, the order of candidates should be reconsidered.

@noah-kim-theori
Copy link
Contributor Author

noah-kim-theori commented May 25, 2024

I think it would be updated also too

        if server_args.context_length is not None:
            self.context_len = server_args.context_length
        else:
            self.context_len = get_context_length(self.hf_config)

Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant