[CI] Add Buildkite #2355

simon-mo · 2024-01-05T19:15:21Z

This PR adds basic setup for GPU CI environment. It should enable us to run our tests on L4 GPUs. As developer, you can add new tests to .buildkite/test-pipeline.yaml.

Currently, I have all tests enabled, in addition to benchmarks. However, I don't want to block the merge of this PR for debugging model output (test_models.py) and tuning memory (test_attention.py). I marked those tests as "soft fail" for now.

Please reference the latest build here: https://buildkite.com/vllm/ci/builds/182 as example. The end to end time of the build on worst case (fresh docker build, slow machine starts) is about 1hr, on the best case (docker cached, machine available) is about 15 minutes. Of course, if there are too many PRs submitted at the same time, it might needs to wait and queue up a bit. We are capped at 10 GPU machines due to budget reason. The full infrastructure setup is described and maintained in a separate repo https://github.com/vllm-project/buildkite-ci.

Code change in the PR is mostly in .buildkite directory and associated Dockerfile and setup.py. Everything else is done in order to make existing test pass.

Future work includes:

Fix test models
Fix test kernels
Migrate lint, docs, and wheels to CPU only machines
Add tests for chat models (test chat template coverage)
Benchmark real models (instead of just OPT) on A100

simon-mo · 2024-01-12T01:19:33Z

note to self: The one last blocker is the memory requirements for huggingface model, we can run models on 2xL4 in vLLM but huggingface doesn't have a good TP strategy that's simple to use. currently i'm trying accelerate to do offload, but if it doesn't work we might go A100

simon-mo · 2024-01-12T03:20:53Z

todos: migrate lint, docs, and wheels to CPU only machines
add tests for chat models (test chat template coverage)

simon-mo · 2024-01-12T17:58:31Z

I soft failed the kernels and models tests. The models ran successfully with bfloat16 but the some output doesn't match :(. The kernel is too difficult to tune.

.buildkite/test-pipeline.yaml

zhuohan123

Thanks for the hard work Simon! Left some small comments.

zhuohan123 · 2024-01-13T00:08:41Z

tests/async_engine/test_api_server.py

+                if max_tries == 0:
+                    raise RuntimeError("Server did not start") from err


Suggested change

if max_tries == 0:

raise RuntimeError("Server did not start") from err

if max_tries == 0:

raise RuntimeError("Server did not start") from err

max_tries -= 1

zhuohan123 · 2024-01-13T00:09:45Z

tests/distributed/test_comm_ops.py

 import pytest
 import torch
+import ray


Just curious, why change to ray?

in the code below, i added a comment saying ray gives way better log for debugging purpose (i couldn't figure the failures from multiprocessing)

gotcha! makes a lot of sense.

zhuohan123 · 2024-01-13T00:10:17Z

tests/models/test_models.py

@@ -21,7 +21,8 @@


 @pytest.mark.parametrize("model", MODELS)
-@pytest.mark.parametrize("dtype", ["float"])
+# half is required to get this working on CI's L4 GPU
+@pytest.mark.parametrize("dtype", ["half"])


Let's keep this as float, since otherwise the test will fail on a100s.

We can implement a simpler test for L4 in another PR.

zhuohan123 · 2024-01-14T19:10:57Z

@simon-mo Let’s expedite this PR so that we can have CI working for all other PRs?

simon-mo added 30 commits January 5, 2024 19:15

add initial pipelien setup

9885a22

fix interpolation

4c72ed6

skip interpolate

aa94936

add docker build target

78715c5

Add K8s workers

8caed4c

Add dependencies

c758a6e

Use wait instead of depends

9999af4

Merge branch 'main' of github.com:vllm-project/vllm into ci-buildkite

07d9770

set workingDir

3dec83c

try bash inteasd

c2ab470

move the test dir location

ccf8ef4

re-install in new workspace

9035ad8

fix

457e056

try pip install instead?

748cfe5

copy everything

702d299

copy v2

63c885f

fix workspace

e9aa19a

fix workspace

f81af48

fix workspace

7375cb0

fix workspace

dd45ab3

fix workspace

80f2741

fix workspace

06d21ca

fix workspace

3bcfd93

fix workspace

8d11184

fix workspace

59052e0

fix workspace

8f92995

fix workspace

6a5ffd0

fix workspace

0821757

fix workspace

bc8edfa

fix workspace

f2fff98

simon-mo added 7 commits January 11, 2024 19:45

try fix tests

7c76677

another round of fixes

d2a5b51

another round

a697e9b

add git

b445350

Merge branch 'main' of github.com:vllm-project/vllm into ci-buildkite

2ed9005

change test cache config

a984d6b

use fp16 for tests

58d9ca7

simon-mo added 2 commits January 12, 2024 03:26

more fixes

c5c9bb1

soft_fail kernels and models tests due to partial failure

e650571

simon-mo added 2 commits January 12, 2024 17:58

lint

8d055f5

small nits

4b5636d

simon-mo mentioned this pull request Jan 12, 2024

OpenAI refactoring #2360

Merged

simon-mo requested review from zhuohan123, Yard1 and WoosukKwon January 12, 2024 22:06

simon-mo commented Jan 12, 2024

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

simon-mo added 3 commits January 12, 2024 14:07

Apply suggestions from code review

2979bb0

Merge branch 'main' into ci-buildkite

7193779

fix lint

9856edd

zhuohan123 approved these changes Jan 13, 2024

View reviewed changes

simon-mo added 2 commits January 14, 2024 19:13

address comment

ac31f21

just keep waiting

14fe8a9

simon-mo merged commit 6e01e8c into vllm-project:main Jan 14, 2024
14 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Jan 18, 2024

[CI] Add Buildkite (vllm-project#2355)

2cab301

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[CI] Add Buildkite (vllm-project#2355)

5e043cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add Buildkite #2355

[CI] Add Buildkite #2355

simon-mo commented Jan 5, 2024 •

edited

simon-mo commented Jan 12, 2024

simon-mo commented Jan 12, 2024 •

edited

simon-mo commented Jan 12, 2024

zhuohan123 left a comment

zhuohan123 Jan 13, 2024

zhuohan123 Jan 13, 2024

simon-mo Jan 14, 2024

zhuohan123 Jan 14, 2024

zhuohan123 Jan 13, 2024

zhuohan123 Jan 13, 2024

zhuohan123 commented Jan 14, 2024

		if max_tries == 0:
		raise RuntimeError("Server did not start") from err

[CI] Add Buildkite #2355

[CI] Add Buildkite #2355

Conversation

simon-mo commented Jan 5, 2024 • edited

simon-mo commented Jan 12, 2024

simon-mo commented Jan 12, 2024 • edited

simon-mo commented Jan 12, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Jan 13, 2024

Choose a reason for hiding this comment

zhuohan123 Jan 13, 2024

Choose a reason for hiding this comment

simon-mo Jan 14, 2024

Choose a reason for hiding this comment

zhuohan123 Jan 14, 2024

Choose a reason for hiding this comment

zhuohan123 Jan 13, 2024

Choose a reason for hiding this comment

zhuohan123 Jan 13, 2024

Choose a reason for hiding this comment

zhuohan123 commented Jan 14, 2024

simon-mo commented Jan 5, 2024 •

edited

simon-mo commented Jan 12, 2024 •

edited