Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Add Buildkite #2355

Merged
merged 74 commits into from
Jan 14, 2024
Merged

[CI] Add Buildkite #2355

merged 74 commits into from
Jan 14, 2024

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Jan 5, 2024

This PR adds basic setup for GPU CI environment. It should enable us to run our tests on L4 GPUs. As developer, you can add new tests to .buildkite/test-pipeline.yaml.

Currently, I have all tests enabled, in addition to benchmarks. However, I don't want to block the merge of this PR for debugging model output (test_models.py) and tuning memory (test_attention.py). I marked those tests as "soft fail" for now.

Please reference the latest build here: https://buildkite.com/vllm/ci/builds/182 as example. The end to end time of the build on worst case (fresh docker build, slow machine starts) is about 1hr, on the best case (docker cached, machine available) is about 15 minutes. Of course, if there are too many PRs submitted at the same time, it might needs to wait and queue up a bit. We are capped at 10 GPU machines due to budget reason. The full infrastructure setup is described and maintained in a separate repo https://github.com/vllm-project/buildkite-ci.

Code change in the PR is mostly in .buildkite directory and associated Dockerfile and setup.py. Everything else is done in order to make existing test pass.

Future work includes:

  • Fix test models
  • Fix test kernels
  • Migrate lint, docs, and wheels to CPU only machines
  • Add tests for chat models (test chat template coverage)
  • Benchmark real models (instead of just OPT) on A100

@simon-mo
Copy link
Collaborator Author

note to self: The one last blocker is the memory requirements for huggingface model, we can run models on 2xL4 in vLLM but huggingface doesn't have a good TP strategy that's simple to use. currently i'm trying accelerate to do offload, but if it doesn't work we might go A100

@simon-mo
Copy link
Collaborator Author

simon-mo commented Jan 12, 2024

todos: migrate lint, docs, and wheels to CPU only machines
add tests for chat models (test chat template coverage)

@simon-mo
Copy link
Collaborator Author

I soft failed the kernels and models tests. The models ran successfully with bfloat16 but the some output doesn't match :(. The kernel is too difficult to tune.

Copy link
Collaborator

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work Simon! Left some small comments.

Comment on lines 65 to 66
if max_tries == 0:
raise RuntimeError("Server did not start") from err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if max_tries == 0:
raise RuntimeError("Server did not start") from err
if max_tries == 0:
raise RuntimeError("Server did not start") from err
max_tries -= 1

import pytest
import torch
import ray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why change to ray?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the code below, i added a comment saying ray gives way better log for debugging purpose (i couldn't figure the failures from multiprocessing)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! makes a lot of sense.

@@ -21,7 +21,8 @@


@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["float"])
# half is required to get this working on CI's L4 GPU
@pytest.mark.parametrize("dtype", ["half"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this as float, since otherwise the test will fail on a100s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement a simpler test for L4 in another PR.

@zhuohan123
Copy link
Collaborator

@simon-mo Let’s expedite this PR so that we can have CI working for all other PRs?

@simon-mo simon-mo merged commit 6e01e8c into vllm-project:main Jan 14, 2024
14 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Jan 18, 2024
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants