Skip to content

[Bug] save_sharded_state failed #4633

@AllenXu93

Description

@AllenXu93

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

when I run save_sharded_state.py with --max-file-size args, it failed with error:

#  python3 examples/runtime/engine/save_sharded_state.py --model-path /data/modelload-test/Qwen2.5-32B-Instruct/ -o /data-local/Qwen2.5-32B-Instruct/ --max-file-size 10737418240

.....

[rank0]:[W320 16:07:23.590363940 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Traceback (most recent call last):
  File "/sgl-workspace/sglang/examples/runtime/engine/save_sharded_state.py", line 74, in <module>
    main(args)
  File "/sgl-workspace/sglang/examples/runtime/engine/save_sharded_state.py", line 57, in main
    llm.save_sharded_model(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 395, in save_sharded_model
    self.collective_rpc("save_sharded_model", **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 389, in collective_rpc
    assert recv_req.success, recv_req.message
AssertionError: '>' not supported between instances of 'int' and 'str'

Reproduction

run save_sharded_state.py with --max-file-size args.
For example:

python3 examples/runtime/engine/save_sharded_state.py --model-path <model-path> -o <output-path> --max-file-size 10737418240

Environment

python3 -m sglang.check_env
INFO 03-20 16:08:26 init.py:190] Automatically detected platform cuda.
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.07
PyTorch: 2.5.1+cu124
sglang: 0.4.4.post1
sgl_kernel: 0.0.5.post3
flashinfer: 0.2.3+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.13
fastapi: 0.115.11
hf_transfer: 0.1.9
huggingface_hub: 0.29.3
interegular: 0.3.3
modelscope: 1.23.2
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.3.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.66.3
tiktoken: 0.9.0
anthropic: 0.49.0
decord: 0.6.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions