-
Notifications
You must be signed in to change notification settings - Fork 42
Implement runai model streamer for MODEL_IMPL_TYPE=flax_nnx #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # runai_model_streamer_loader | ||
| # The RunAI Model Streamer is a high-performance model loader that serves as an | ||
| # alternative to the default Hugging Face loader. Instead of downloading a model | ||
| # to local disk, it streams the weights from object storage (like GCS) into | ||
| # GPU memory. This streaming process is significantly faster than the traditional | ||
| # disk-based loading method. | ||
| steps: | ||
| - label: "Correctness tests for runai_model_streamer_loader" | ||
| key: "runai_model_streamer_loader_CorrectnessTest" | ||
| soft_fail: true | ||
| agents: | ||
| queue: tpu_v6e_queue | ||
| commands: | ||
| - .buildkite/scripts/run_in_docker.sh python3 -m pytest -s -v /workspace/tpu_inference/tests/e2e/test_runai_model_streamer_loader.py::test_correctness | ||
| - label: "Record correctness test result for runai_model_streamer_loader" | ||
| key: "record_runai_model_streamer_loader_CorrectnessTest" | ||
| depends_on: "runai_model_streamer_loader_CorrectnessTest" | ||
| env: | ||
| CI_TARGET: "runai_model_streamer_loader" | ||
| CI_STAGE: "CorrectnessTest" | ||
| CI_CATEGORY: "feature support matrix" | ||
| agents: | ||
| queue: cpu | ||
| commands: | ||
| - | | ||
| .buildkite/scripts/record_step_result.sh runai_model_streamer_loader_CorrectnessTest | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,3 +15,4 @@ torchvision==0.24.0 | |
| pathwaysutils | ||
| parameterized | ||
| numba==0.62.1 | ||
| runai-model-streamer[s3,gcs]==0.15.0 | ||
jrplatin marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # This file contains end-to-end tests for the RunAI Model Streamer loader. | ||
| # | ||
| # The RunAI Model Streamer is a high-performance model loader that serves as an | ||
| # alternative to the default Hugging Face loader. Instead of downloading a model | ||
| # to local disk, it streams the weights from object storage (like GCS) into | ||
| # GPU memory. This streaming process is significantly faster than the | ||
| # traditional disk-based loading method. | ||
|
|
||
| # The tests in this file verify that loading model weights using the | ||
| # streamer produces the same results as loading the same model using the | ||
| # standard Hugging Face loader. This ensures the correctness of the streamer | ||
| # integration. | ||
|
|
||
| # The tests are performed by: | ||
| # 1. Loading a model from Google Cloud Storage using the `runai_streamer` format. | ||
| # 2. Generating output with this model. | ||
| # 3. Loading the same model from Hugging Face using the default loader. | ||
| # 4. Generating output with this second model. | ||
| # 5. Asserting that the outputs from both models are identical. | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import time | ||
|
|
||
| import pytest | ||
| from vllm import LLM, SamplingParams | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def sampling_config(): | ||
| return SamplingParams(temperature=0, max_tokens=10, ignore_eos=True) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| # TODO(amacaskill): Replace with GKE owned GCS bucket. | ||
jrplatin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def gcs_model_name(): | ||
| return "gs://vertex-model-garden-public-us/llama3/llama3-8b-hf" | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def hf_model_name(): | ||
| return "meta-llama/Meta-Llama-3-8B" | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def prompt(): | ||
| return "Hello, my name is" | ||
|
|
||
|
|
||
| def test_correctness( | ||
| sampling_config: SamplingParams, | ||
| gcs_model_name: str, | ||
| hf_model_name: str, | ||
| prompt: str, | ||
| monkeypatch: pytest.MonkeyPatch, | ||
| ): | ||
| ''' | ||
| Compare the outputs of a model loaded from GCS via runai_model_streamer | ||
| and a model loaded from Hugging Face. The outputs should be the same. | ||
| These tests attempt to use tensor_parallel_size=1. The model is 16GB, | ||
| # and v6e has 32GB of HBM, so it will fit. | ||
| ''' | ||
| # Set ENV variables so that runai_model_streamer uses anonymous GCS access. | ||
| monkeypatch.setenv("GOOGLE_CLOUD_PROJECT", "fake-project") | ||
| monkeypatch.setenv("RUNAI_STREAMER_GCS_USE_ANONYMOUS_CREDENTIALS", "true") | ||
| monkeypatch.setenv("CLOUD_STORAGE_EMULATOR_ENDPOINT", | ||
| "https://storage.googleapis.com") | ||
| gcs_llm = LLM(model=gcs_model_name, | ||
| load_format="runai_streamer", | ||
| max_model_len=128, | ||
| max_num_seqs=16, | ||
| max_num_batched_tokens=256) | ||
| gcs_outputs = gcs_llm.generate([prompt], sampling_config) | ||
| gcs_output_text = gcs_outputs[0].outputs[0].text | ||
| del gcs_llm | ||
| time.sleep(10) # Wait for TPUs to be released | ||
|
|
||
| # Test with Hugging Face model | ||
| hf_llm = LLM(model=hf_model_name, | ||
| max_model_len=128, | ||
| max_num_seqs=16, | ||
| max_num_batched_tokens=256) | ||
| hf_outputs = hf_llm.generate([prompt], sampling_config) | ||
| hf_output_text = hf_outputs[0].outputs[0].text | ||
| del hf_llm | ||
| time.sleep(10) # Wait for TPUs to be released | ||
|
|
||
| assert gcs_output_text == hf_output_text, ( | ||
| f"Outputs do not match! " | ||
| f"GCS output: {gcs_output_text}, HF output: {hf_output_text}") | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.