[Feature]: Offload Model Weights to CPU #3563

chenqianfzh · 2024-03-22T00:09:25Z

🚀 The feature, motivation and pitch

I am interested in adding cpu_offload_weight capability to vLLM. Wonder whether the community would like to see it happen?

The following is my design. I did some simple PoC implementation. Would love to hear feedback and suggestions before investing more time on it.

Objective
The rapid expansion of AI model sizes necessitates an increased demand for GPUs with larger memory capacities. This surge poses challenges for users without access to powerful GPU resources, hindering their ability to run vLLM on numerous models.

To address these challenges, we are devloping a feature called "cpu-offload-weight" to vLLM. With cpu-offload, users can now experiment with large models even without access to high-end GPUs. This democratizes access to vLLM, empowering a broader community of learners and researchers to engage with cutting-edge AI models.

Proposed Features
Upon initial loading of the model weights, cpu-offload pins the entire weight onto the CPU. During inference computation, these weights are streamed into the GPU as required. When computing on a layer, the weights are loaded into GPU from CPU. Upon completion of computation for a given layer, the outputs are retained within the GPU memory to serve as inputs for subsequent layers. Meanwhile, the weights are relocated back to the CPU, freeing up GPU memory for the upcoming layer's computations. This dynamic process optimizes resource utilization and ensures smooth execution of vLLM even with limited GPU resources.

No changes to KV caches or other data structures are needed.

PoC Implementation
I tried this concept in my forked vLLM repo, bd-iaas-us@352a767#diff-bec94efb3136aec6b8c231b14f18447f7ffdee404b534cdc0e57ed92e7d96b8c.

With this change, I am able to run facebook/opt-13b with one single T4 GPU. This GPU is equipped with 15G memory only, while the weight of facebook/opt-13b alone is 25G, let along KV cache, etc.

This implementation is absolutely preliminary, yet it shows:

Cpu offload weights does help democratize vLLM, allowing people to try out big models in GPUs with limited memories.
The changes to the existing models can be safe and simple. If you take a look at the PR I shared above, the changes to opt.py is just 1) adding a new parameter, cpu_offload_weight, to the init of the modules 2)add one line of decorator to some forward() functions to indicate what weights need to be loaded from cpu.

Proposed Future Improvement
Surely, Cpu offload will increase the latency and hurt the throughput. I am going to invest more time on the following improvements:

Prefetching layers' weights. Namely, loading the weights of next layer when the GPU is computing on the current layer, if the memory permits.
Parallel Fetching. For inference on multiple GPUs, involves parallelizing the fetch of each layer across multiple GPUs by using each GPU to fetch only a portion of the layer. Employing the aggregate PCIe links of the GPUs in this manner essentially increases the transfer bandwidth linearly, thus reducing the latency.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

chenqianfzh · 2024-03-25T17:51:52Z

@WoosukKwon @zhuohan123 May I have your opinions? Do you think I should continue?

eigen2017 · 2024-04-23T12:01:44Z

can not agree more

eigen2017 · 2024-04-23T12:39:01Z

@chenqianfzh
hi, we need a percent set , not true or false.

that is part of the weights are on cpu and other on gpu.
could you do it ? thks~

chenqianfzh added the feature request label Mar 22, 2024

XiaoningDing mentioned this issue Apr 10, 2024

Design doc for cpu offloading feature bd-iaas-us/vllm#3

Open

chizhang118 mentioned this issue Apr 11, 2024

Q2 Roadmap bd-iaas-us/vllm#2

Open

11 tasks

eigen2017 mentioned this issue Apr 23, 2024

[Usage]: How to offload some layers to CPU？ #3931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Offload Model Weights to CPU #3563

[Feature]: Offload Model Weights to CPU #3563

chenqianfzh commented Mar 22, 2024

chenqianfzh commented Mar 25, 2024

eigen2017 commented Apr 23, 2024

eigen2017 commented Apr 23, 2024

[Feature]: Offload Model Weights to CPU #3563

[Feature]: Offload Model Weights to CPU #3563

Comments

chenqianfzh commented Mar 22, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

chenqianfzh commented Mar 25, 2024

eigen2017 commented Apr 23, 2024

eigen2017 commented Apr 23, 2024