New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SkyServe] Doc with vllm example #2922

Merged

cblmemo merged 21 commits into master from serve-doc-vllm-example

Jan 6, 2024

Collaborator

cblmemo commented Dec 29, 2023 •

edited

This PR adds an example for vLLM in SkyServe doc.

TODO: Make sure the example works.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky serve up the example and curl to make sure it works
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

cblmemo added 6 commits

December 29, 2023 07:59


          init

3333d34


          title

eaccd60


          use v100

97f6fdb


          todo: add pic for vllm status

26570d6


          Merge branch 'serve-doc-vllm-example' of github.com:skypilot-org/skyp…

…ilot into serve-doc-vllm-example


          add pic & upd text

81188e9

concretevitamin reviewed

View reviewed changes

Collaborator

concretevitamin left a comment

Thanks @cblmemo!

I tried out the vLLM example. It launched 2 replicas on Azure, but they got the following errors during setup

(task, pid=7557) INFO 01-01 18:01:44 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 650kB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 46.5MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 29.2MB/s]?, ?B/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 136kB/s]

...
(task, pid=7557)     torch.distributed.all_reduce(torch.zeros(1).cuda())
(task, pid=7557)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(task, pid=7557)     return func(*args, **kwargs)
(task, pid=7557)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7557)     work = group.allreduce([tensor], opts)
(task, pid=7557) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7557) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7557) Last error:
(task, pid=7557) XML Import Channel : dev 2 not found.

Relatedly, since the 2 replicas failed, 2 new ones are getting started. IIUC it's due to auto_restart. However does this restart behavior loop forever? (In managed spot, we do not restart user job failures by default.)

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Outdated

Comment on lines 63 to 64

		ports: 8080
		accelerators: A100:1

Collaborator

concretevitamin Jan 1, 2024

Is it possible to keep ports / accelerators the same across the two tabs? Few changes to read.
accelerators: is it possible to use {A100, A10G, T4, L4} etc. with appropriate counts? Higher chance to get provisioned, and showcase an important capability

Collaborator Author

cblmemo Jan 4, 2024

Changed to a single A100 for the 13b Vicuna model.

Collaborator

concretevitamin Jan 4, 2024

Thanks. Wdyt about the second point?

Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.

Collaborator Author

cblmemo Jan 5, 2024

Now added more choices borrowed from https://github.com/skypilot-org/skypilot/blob/master/llm/mixtral/serve.yaml

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Show resolved Hide resolved

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Show resolved Hide resolved

cblmemo and others added 7 commits

January 4, 2024 14:12


          Update docs/source/serving/sky-serve.rst

934e315

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>


          Update docs/source/serving/sky-serve.rst

2a82f44

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>


          change to vicuna 13b & A100:1


          Update docs/source/serving/sky-serve.rst

a4947f5

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>


          change order of tab

6c04939


          add sky serve up commands

134f327


          upd pic & minor

24f77bd

Michaelvll reviewed

View reviewed changes

Collaborator

Michaelvll left a comment

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

concretevitamin reviewed

View reviewed changes

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Outdated

Comment on lines 63 to 64

		ports: 8080
		accelerators: A100:1

Collaborator

concretevitamin Jan 4, 2024

Thanks. Wdyt about the second point?

Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

concretevitamin reviewed

View reviewed changes

docs/source/serving/sky-serve.rst Show resolved Hide resolved

cblmemo and others added 7 commits

January 5, 2024 08:53


          Update docs/source/serving/sky-serve.rst

4815ca9

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>


          Update docs/source/serving/sky-serve.rst

f2d9811

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>


          upd yaml

23fb2bd


          upd to mixtral; TODO: test

2a45688


          use mixtral for vllm & vicuna for tgi

676824c


          upd pic

308015f


          upd ip & output for vllm example

fdd6c6e

Collaborator Author

cblmemo commented Jan 5, 2024

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

Sure! Will do it after the doc is ready 🫡

concretevitamin reviewed

View reviewed changes

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

docs/source/serving/sky-serve.rst Outdated

    
            @@ -62,7 +62,7 @@ Here is a simple example of serving an LLM model (:code:`lmsys/vicuna-13b-v1.5`)
          
                          # Fields below describe each replica.

                          resources:

                            ports: 8080

                            accelerators: A100:1

                            accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}

Collaborator

concretevitamin Jan 5, 2024

This picks Azure's A100-80GB:2 for me and results in the same error


(task, pid=7600)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7600)     work = group.allreduce([tensor], opts)
(task, pid=7600) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7600) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7600) Last error:
(task, pid=7600) XML Import Channel : dev 2 not found.

Did you not encounter this error?

Is there a good way to ban Azure:


I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910]  CLOUD   INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus          7.35          ✔
I 01-05 14:13:07 optimizer.py:910]  GCP     g2-standard-96              96      384       L4:8           us-east4-a      7.98
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-2g              24      340       A100-80GB:2    us-central1-a   10.06
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus          14.69
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-highgpu-4g               48      340       A100:4         us-central1-a   14.69
I 01-05 14:13:07 optimizer.py:910]  AWS     g5.48xlarge                 192     768       A10G:8         us-east-1       16.29
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a   20.11
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_ND96asr_v4         96      900       A100:8         eastus          27.20
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-highgpu-8g               96      680       A100:8         us-central1-a   29.39
I 01-05 14:13:07 optimizer.py:910]  AWS     p4d.24xlarge                96      1152      A100:8         us-east-1       32.77
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus          32.77
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a   40.22
I 01-05 14:13:07 optimizer.py:910]  AWS     p4de.24xlarge               96      1152      A100-80GB:8    us-east-1       40.97
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------

Collaborator Author

cblmemo Jan 6, 2024

I didn't encountered this error; My controller chooses two L4:8 replicas

concretevitamin approved these changes

View reviewed changes

Collaborator

concretevitamin left a comment

LGTM, modulo the known issue on Azure #2905, which seems ok for now.


          Update docs/source/serving/sky-serve.rst

0340e95

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

cblmemo merged commit 4ac3479 into master

19 checks passed

cblmemo deleted the serve-doc-vllm-example branch

January 6, 2024 05:21

cblmemo mentioned this pull request

[Examples] vLLM example for SkyServe + Mixtral #2948

Merged

5 tasks

Collaborator Author

cblmemo commented Jan 6, 2024

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

Added in #2948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment