[Weight Loading] Expand quantized weight reloading support #28627

kylesayrs · 2025-11-13T08:34:33Z

Purpose

Support quantized weight reloading for the purposes of QeRL-style quantized rollouts for RL training
Support weight reloading in the case where kernel formats (parameters after process_weights_after_loading is called) do not match the model loading formats (parameters before process_weights_after_loading is called)

Background

When a model is loaded, its parameters are often modified by process_weights_after_loading to be better suited to the chosen kernel. This processing can involve online weight quantization or operations like padding and repacking. However, the new parameters after processing cannot be used to load new weights, because they are no longer in the format that they were when they were loaded.

Proposed Solution

In order to support reloading of model weights after kernel formatting occurs, information about the model state prior to process_weights_after_loading must be captured. This capture must include metadata like shape and dtype, as well as attributes such as the parameter's weight loader.

After the model format has been captured, the captured metadata can then be used to reconstruct the model load format whenever a weight reload occurs. Newly allocated parameters created by process_weights_after_loading are also deleted prior to restoration in order to avoid device memory overflows.

In the case that a user has already formatted their weights into kernel format, this system can be bypassed by calling reload_weights(process_weights_after_loading=False).

Integration Plan

Many user scripts already exist to reload weights by directly calling model.load_weights, rather than runner.reload_weights. These changes do not affect the existing functionality of those scripts. However, these scripts will only work if the weights already match the kernel format.

If users want to support loading weights which are not in kernel format (for example, to let vllm handle auto weight quantization), they are encouraged to either use runner.reload_weights, or wrap their model.load_weights calls with restore_weights_for_reloading and process_weights_after_loading.

While this design requires some user code changes, I think that these changes are reasonable to add functionality which did not exist previously.

Changes

Refactor online_quantization.py
- Rename core functions to record_weights_for_reloading and restore_weights_for_reloading
- Remove process_weights_after_loading_already_called, weight_metadata_and_attr_saved, original_weights_rebuild_keys, recorded_weight_attr flags/attributes, instead use a single attribute weight_loading_metadata
- Utilize meta tensors to handle recording of parameter attributes
- Expand support to handle quantization configs whose process_weights_after_loading functions create new parameters (these parameters have to be deleted on reload in order to avoid gpu oom)
- Skip reallocation of parameters which already exist (and have the same weight attributes) in order to reduce latency from model reloading
Expand reload_weights to support new arguments and quantization configs
- weights_iterator allows a user to pass weights in memory. If none is provided, the weights are reloaded from disk
- process_weights_after_loading allows a user to load weights directly from kernel format

Future Work

The reload_weights function can be generalized to runners outside of the GPUModelRunner. Future work could move this implementation to a base class or mixin to share functionality with runners such as tpu, xpu, etc.
This implementation theoretically does not have any requirements on quantization configs (quantization postprocessing is free to allocate new parameters, etc.) However, the new functionality of this PR needs to be tested with more configs, after which they can be added to the list of ONLINE_RELOAD_QUANT_CONFIGS.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…om disk Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…nd looks good Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

jerryzh168 · 2025-11-13T18:39:18Z

vllm/v1/worker/gpu_model_runner.py

-    def reload_weights(self) -> None:
-        assert getattr(self, "model", None) is not None, (
-            "Cannot reload weights before model is loaded."
+    def reload_weights(


so a lot of functionality is here, does this mean all runners have to be modified to include these?

All (well supported) runners already have a reload_weights method. Future work would expand their support to support a weights iterator argument.

The reload_weights function can be generalized to runners outside of the GPUModelRunner. Future work could move this implementation to a base class or mixin to share functionality with runners such as tpu, xpu, etc.

david6666666 · 2025-11-18T12:28:39Z

Which approach should we take? ##26327 or this pr, @kylesayrs @jerryzh168.

jerryzh168 · 2025-11-18T18:53:07Z

@david6666666 see #26327 (comment), current plan is

land Move online quantization to model.load_weights #26327
design the pre weight loading quantization API to save peak memory [Quantization] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema #27280
update the online quant integration to use the pre weight loading API
after that we can re-evaluate the approaches

the API and where we do quantization is going to change a few times in this process so we don't need to spend time improving current API.

mergify · 2025-11-19T08:33:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kylesayrs added 10 commits November 11, 2025 13:37

no new parameters

dbf10e5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix typo

1576409

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

register weight scale in create params, still issue with reloading fr…

749c91c

…om disk Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP: reload

4fc9043

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP: standardize formats, style

d2504cf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

rename

82a25cc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

base runner

af5772e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

leave base class for later, fix some typos, small regression tested a…

bf4dcee

…nd looks good Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

timing, general cleanup

d91b6f2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

restore fp8 changes, fix shared modules and attached methods

4e5270e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Nov 13, 2025

[WIP] Reload weights neuralmagic/vllm#128

Closed

mergify bot added the v1 label Nov 13, 2025

reduce diff

a964ddf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Nov 13, 2025

Move online quantization to model.load_weights #26327

Merged

jerryzh168 reviewed Nov 13, 2025

View reviewed changes

david6666666 mentioned this pull request Nov 17, 2025

[Bug]: reload_weights (via collective_rpc) Fails to Load Quantized Models Correctly, Causing Tensor Shape AssertionError #28606

Open

1 task

JianDan0212 mentioned this pull request Nov 18, 2025

[Bug]: Model outputs only exclamation marks after sleep/wake cycl MetaX-MACA/vLLM-metax#131

Open

mergify bot added the needs-rebase label Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Weight Loading] Expand quantized weight reloading support #28627

[Weight Loading] Expand quantized weight reloading support #28627

kylesayrs commented Nov 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

jerryzh168 Nov 13, 2025

Uh oh!

kylesayrs Nov 14, 2025

Uh oh!

david6666666 commented Nov 18, 2025

Uh oh!

jerryzh168 commented Nov 18, 2025 •

edited

Loading

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Weight Loading] Expand quantized weight reloading support #28627

Are you sure you want to change the base?

[Weight Loading] Expand quantized weight reloading support #28627

Conversation

kylesayrs commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Proposed Solution

Integration Plan

Changes

Future Work

Uh oh!

jerryzh168 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Nov 18, 2025

Uh oh!

jerryzh168 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Nov 13, 2025 •

edited by github-actions bot

Loading

jerryzh168 commented Nov 18, 2025 •

edited

Loading