[Frontend] [Core] feat: Add model loading using `tensorizer` #3476

sangstar · 2024-03-18T20:53:46Z

Tensorizer Support

This PR allows models used for the OpenAI-compatible API server to be loaded using
Coreweave's Tensorizer, enabling extremely fast (faster than cached Safetensors) model loads from HTTP/HTTPS,
Redis, and S3 endpoints.

The key changes involved are:

Adds TensorizerConfig to the set of configs.
Adds tensorizer_loader.py to vllm/model_executor that provides utility functions
for tensorizer.
Adds multiple args to the vLLM's OpenAI inference service entrypoint that allows
the user to specify the path to serialized-by-tensorizer model tensors, as well as
arguments for tensorizer's deserializer.
Allows deserialization of serialized model tensors in HuggingFace's model format,
as well as supporting deserializing serialized vLLM-formatted models, allowing the
use of loading with plaid_mode, which can allow Llama 2 13B non-locally to start serving requests
in as little as 10 seconds. Also supports encrypting and decrypting model tensors.
Adds a tensorize_vllm_model.py script to examples/ that allows vLLM models to be serialized and
deserialized with tensorizer.
Adds tensorizer as an optional dependency.

Credentialing for S3 is supported by passing a user's access and secret key to
S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables respectively. It can also be specified as CLI args to the api server entrypoint.

Model loading benchmarks

Tensorizer can load models like Llama 2 13B in as little as 10 seconds. In order to do so, a model must be
serialized using TensorSerializer to a .tensors file located either locally or through a S3, HTTP/HTTPS, or Redis
endpoint. --tensorizer-uri must be specified with the serialized tensors location when invoking the API server.

Example usage:

python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model EleutherAI/pythia-6.9b \
--load-format tensorizer \
--tensorizer-uri s3://tensorized/EleutherAI/pythia-6.9b/fp16/model.tensors

If a vLLM model is serialized, plaid_mode can be used, which loads much faster. The following plot demonstrates model loading time benchmarks for vLLM's OpenAI-compatible inference server on a Nvidia A40 GPU.

Tensorizer is so fast that it loads models faster than Safetensors even locally.

sangstar · 2024-03-25T19:19:44Z

@cadedaniel @rkooo567 Pinging for an assigned reviewer from someone on the team when possible!

sangstar · 2024-04-03T20:36:00Z

@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96

All tests are passing. Can I get eyes on this please? Cheers!

This feature allows vLLM models to be loaded extremely fast using `tensorizer`. `tensorizer` loads serialized model tensors from HTTP/HTTPS, Redis, S3 endpoints, or locally, typically on the scale of multiple GB/s.

This allows the deserializer to access S3 credentials when reading. It extracts the access key from env variables `S3_ACCESS_KEY_ID` and secret key from `S3_SECRET_ACCESS_KEY`.

Previous commit wasn't able to pass S3 credentials through `TensorDeserializer`, so it is instead passed to `stream_io.openstream` which is used to instantiate the `TensorDeserializer`.

Removed functionality to allow for tensorizing without `plaid_mode`, and updating to `tensorizer==2.8.0`

Replaces `download_dir` with `tensorizer_uri` as a `TensorizerArgs` param. This is due to discussions on `download_dir` being a confusing and ultimately not helpful parameter for the location of model tensors. Instead, `download_dir` is back to the definition coinciding with the convention set by HuggingFace; as a location to download weights for caching. A new parameter takes its place: `tensorizer_uri`. This specifically deals with locating model tensors for `tensorizer`.

Also slight formatting fix for warning when loading weights with `download_dir` not set to `None`.

Integrated changes from ssteel/tensorizer-support branch that allowed for deserializing vLLM models.

Integrated previous support for vLLM-formatted model loading with `tensorizer` that makes full use of loading to the GPU with `plaid-mode`, as well as falling back on being able to load HuggingFace models for serving using the CPU so that vLLM can perform its manual GPU loading.

Fixed some unnecessary formatting changes in `arg_utils.py`, `weight_utils.py` and `model_loader.py` fixed improperly passing `force_http` to `TensorDeserializer` rather than `open_stream`.

Misc. fixes from now resolved conversations. Mostly consisting of changes to syntax, style, adding docstrings, and versioning.

`examples/tensorize_vllm_model.py` now correctly instantiates vLLM-formatted models.

sangstar · 2024-04-12T20:46:23Z

Added additional tests and some fixes; checks are all passing!

rkooo567

LGTM if @Yard1 gives a go

ywang96

Thank you very much @sangster for your contribution to add this feature, and @Yard1 @rkooo567 for the detailed reviews!

I've tested this PR on different scenarios (from S3, from a local directory with serialized weights using the example script, etc) and left a few feedback on the PR. PTAL, thanks!

vllm/engine/arg_utils.py

examples/tensorize_vllm_model.py

vllm/model_executor/tensorizer_loader.py

examples/tensorize_vllm_model.py

Replaced the model initialization process using `LLMEngine`, allowing vLLM to handle and therefore optimize the initial model loading process. Added testing for quantization.

sangstar · 2024-04-13T21:44:10Z

@ywang96 @Yard1 @rkooo567

Thank you all very much for your reviews! I've implemented the changes from @ywang96 's comments. To summarize:

An error is raised if the tensor parallel size exceeds 1 and attempting to use Tensorizer (test added)
The serialization step in examples/tensorize_vllm_model.py now instantiates the model to serialize using LLMEngine
Meta tensors found when deserializing will raise an error
Removed forcing float16 from the parser for examples/tensorize_vllm_model.py
I've also additionally added a PerformanceWarning when trying to load a tensorized model with quantization, as that is a bit unstable at the moment (I may try to look in to this in another PR) (test added).
Added the Tensorizer testing folder for the CI suite

sangstar · 2024-04-14T00:07:12Z

Some minor fixes to ensure the testing suite can run the tensorizer tests. All passing! Thanks very much again for the reviews @rkooo567 @Yard1 @ywang96 let me know if anything else needed! :)

ywang96

LGTM! Thank you again @sangstar for all the work and test coverage on this PR to add this feature!

…oject#3476)

bbrowning · 2024-05-03T16:03:34Z

I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the num_readers parameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.

INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/s

That's an awesome improvement, and thank you!

sangstar · 2024-05-03T16:30:53Z

I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the num_readers parameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.

INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/s

That's an awesome improvement, and thank you!

I'm thrilled to hear that! I currently actually have a new PR up #4208 that uses the full 2.9.0 release, has better usage documentation, and automated inferring a vLLM-serialized model.

…oject#3476)

sangstar force-pushed the sangstar/integrate-tensorizer branch from 23fc7b1 to 7373013 Compare March 20, 2024 19:26

This was referenced Mar 20, 2024

[Feature]: Add model loading using CoreWeave's tensorizer #3533

Closed

[Bug]: Using pytest --forked when CUDA is not generally fork-safe #3557

Closed

sangstar force-pushed the sangstar/integrate-tensorizer branch 3 times, most recently from ac8f32b to 25e9fb7 Compare March 25, 2024 16:28

sangstar force-pushed the sangstar/integrate-tensorizer branch 4 times, most recently from 64e8637 to 23a6c03 Compare April 3, 2024 18:39

sangstar added 18 commits April 4, 2024 09:35

feat: Support loading model tensors using tensorizer

dfe2f2f

This feature allows vLLM models to be loaded extremely fast using `tensorizer`. `tensorizer` loads serialized model tensors from HTTP/HTTPS, Redis, S3 endpoints, or locally, typically on the scale of multiple GB/s.

fix: Remove unnecessary files

097f297

fix(vllm-tensorizer): Allow providing S3 credentials

24e8657

This allows the deserializer to access S3 credentials when reading. It extracts the access key from env variables `S3_ACCESS_KEY_ID` and secret key from `S3_SECRET_ACCESS_KEY`.

fix: Fix passing S3 auth vars through stream

6192ff3

Previous commit wasn't able to pass S3 credentials through `TensorDeserializer`, so it is instead passed to `stream_io.openstream` which is used to instantiate the `TensorDeserializer`.

fix: Disallowing plaid_mode = False and updating tensorizer version

fbc847b

Removed functionality to allow for tensorizing without `plaid_mode`, and updating to `tensorizer==2.8.0`

fix: Remove store_true action for --tensorizer-uri

cf42149

Also slight formatting fix for warning when loading weights with `download_dir` not set to `None`.

refactor: No 2x copying for tensorizer (WIP)

c1839f4

chore: Omit commandeering weight loaders for merging layers (WIP)

b28b26e

feat: Re-add deserializing vLLM models

fad72a4

Integrated changes from ssteel/tensorizer-support branch that allowed for deserializing vLLM models.

perf: Add force_http=True for faster loading speeds

8225c32

chore: Reformat code with format.sh, cleanup debugging code

f7c9cc7

chore: Fix formatting, some misc. changes

44b05ba

Fixed some unnecessary formatting changes in `arg_utils.py`, `weight_utils.py` and `model_loader.py` fixed improperly passing `force_http` to `TensorDeserializer` rather than `open_stream`.

fix: Correct logging for loading tensorizer with cpu

17977b0

chore: Implement changes from feedback

68f2a51

Misc. fixes from now resolved conversations. Mostly consisting of changes to syntax, style, adding docstrings, and versioning.

fix: Correctly instantiate vLLM-formatted models

0c72c2c

`examples/tensorize_vllm_model.py` now correctly instantiates vLLM-formatted models.

chore: Reformat and delete deprecated comment from .ipynb

af10594

rkooo567 approved these changes Apr 12, 2024

View reviewed changes

ywang96 reviewed Apr 13, 2024

View reviewed changes

sangstar added 3 commits April 13, 2024 13:17

fix: Add error for device scattering and initial handling for quant

ba6927d

perf: Multiple changes in response to comments

bd461cc

Replaced the model initialization process using `LLMEngine`, allowing vLLM to handle and therefore optimize the initial model loading process. Added testing for quantization.

perf: Final changes to resolve comments

ca2a3fb

sangstar added 6 commits April 13, 2024 18:28

fix: Skip tests if cURL not installed, add example script for testing

428f53d

Run yapf and ruff

88f1a67

tests: Install cURL for tensorizer tests for testing suite

d2491ac

tests: Install libsodium23 for CI tensorizer tests

d77215f

fix: Fix testing import path

9de338c

Run yapf and ruff

95251d7

sangstar requested a review from ywang96 April 14, 2024 00:07

ywang96 approved these changes Apr 14, 2024

View reviewed changes

ywang96 merged commit 711a000 into vllm-project:main Apr 14, 2024
46 checks passed

sangstar deleted the sangstar/integrate-tensorizer branch April 16, 2024 12:57

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

1073949

…oject#3476)

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

b7a30d9

…oject#3476)

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

1f9f394

…oject#3476)

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

fab8ca1

…oject#3476)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

tjohnson31415 mentioned this pull request May 21, 2024

[Feature]: Support loading of sharded vLLM serialized models with Tensorizer #4957

Open

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

4ae0dd2

…oject#3476)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] [Core] feat: Add model loading using `tensorizer` #3476

[Frontend] [Core] feat: Add model loading using `tensorizer` #3476

sangstar commented Mar 18, 2024 •

edited

sangstar commented Mar 25, 2024

sangstar commented Apr 3, 2024 •

edited

sangstar commented Apr 12, 2024

rkooo567 left a comment

ywang96 left a comment •

edited

sangstar commented Apr 13, 2024 •

edited

sangstar commented Apr 14, 2024

ywang96 left a comment

bbrowning commented May 3, 2024

sangstar commented May 3, 2024

[Frontend] [Core] feat: Add model loading using tensorizer #3476

[Frontend] [Core] feat: Add model loading using tensorizer #3476

Conversation

sangstar commented Mar 18, 2024 • edited

Tensorizer Support

Model loading benchmarks

sangstar commented Mar 25, 2024

sangstar commented Apr 3, 2024 • edited

sangstar commented Apr 12, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

ywang96 left a comment • edited

Choose a reason for hiding this comment

sangstar commented Apr 13, 2024 • edited

sangstar commented Apr 14, 2024

ywang96 left a comment

Choose a reason for hiding this comment

bbrowning commented May 3, 2024

sangstar commented May 3, 2024

[Frontend] [Core] feat: Add model loading using `tensorizer` #3476

[Frontend] [Core] feat: Add model loading using `tensorizer` #3476

sangstar commented Mar 18, 2024 •

edited

sangstar commented Apr 3, 2024 •

edited

ywang96 left a comment •

edited

sangstar commented Apr 13, 2024 •

edited