Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] Doc with vllm example #2922

Merged
merged 21 commits into from
Jan 6, 2024
Merged

[SkyServe] Doc with vllm example #2922

merged 21 commits into from
Jan 6, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Dec 29, 2023

This PR adds an example for vLLM in SkyServe doc.

TODO: Make sure the example works.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky serve up the example and curl to make sure it works
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo!

I tried out the vLLM example. It launched 2 replicas on Azure, but they got the following errors during setup

(task, pid=7557) INFO 01-01 18:01:44 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 650kB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 46.5MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 29.2MB/s]?, ?B/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 136kB/s]

...
(task, pid=7557)     torch.distributed.all_reduce(torch.zeros(1).cuda())
(task, pid=7557)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(task, pid=7557)     return func(*args, **kwargs)
(task, pid=7557)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7557)     work = group.allreduce([tensor], opts)
(task, pid=7557) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7557) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7557) Last error:
(task, pid=7557) XML Import Channel : dev 2 not found.

Relatedly, since the 2 replicas failed, 2 new ones are getting started. IIUC it's due to auto_restart. However does this restart behavior loop forever? (In managed spot, we do not restart user job failures by default.)

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
Comment on lines 63 to 64
ports: 8080
accelerators: A100:1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is it possible to keep ports / accelerators the same across the two tabs? Few changes to read.
  • accelerators: is it possible to use {A100, A10G, T4, L4} etc. with appropriate counts? Higher chance to get provisioned, and showcase an important capability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a single A100 for the 13b Vicuna model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Wdyt about the second point?

Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Show resolved Hide resolved
cblmemo and others added 7 commits January 4, 2024 14:12
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
Comment on lines 63 to 64
ports: 8080
accelerators: A100:1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Wdyt about the second point?

Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 5, 2024

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

Sure! Will do it after the doc is ready 🫡

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
@@ -62,7 +62,7 @@ Here is a simple example of serving an LLM model (:code:`lmsys/vicuna-13b-v1.5`)
# Fields below describe each replica.
resources:
ports: 8080
accelerators: A100:1
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This picks Azure's A100-80GB:2 for me and results in the same error


(task, pid=7600)   File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7600)     work = group.allreduce([tensor], opts)
(task, pid=7600) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7600) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7600) Last error:
(task, pid=7600) XML Import Channel : dev 2 not found.

Did you not encounter this error?

Is there a good way to ban Azure:


I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910]  CLOUD   INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus          7.35          ✔
I 01-05 14:13:07 optimizer.py:910]  GCP     g2-standard-96              96      384       L4:8           us-east4-a      7.98
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-2g              24      340       A100-80GB:2    us-central1-a   10.06
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus          14.69
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-highgpu-4g               48      340       A100:4         us-central1-a   14.69
I 01-05 14:13:07 optimizer.py:910]  AWS     g5.48xlarge                 192     768       A10G:8         us-east-1       16.29
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a   20.11
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_ND96asr_v4         96      900       A100:8         eastus          27.20
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-highgpu-8g               96      680       A100:8         us-central1-a   29.39
I 01-05 14:13:07 optimizer.py:910]  AWS     p4d.24xlarge                96      1152      A100:8         us-east-1       32.77
I 01-05 14:13:07 optimizer.py:910]  Azure   Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus          32.77
I 01-05 14:13:07 optimizer.py:910]  GCP     a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a   40.22
I 01-05 14:13:07 optimizer.py:910]  AWS     p4de.24xlarge               96      1152      A100-80GB:8    us-east-1       40.97
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't encountered this error; My controller chooses two L4:8 replicas

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo the known issue on Azure #2905, which seems ok for now.

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@cblmemo cblmemo merged commit 4ac3479 into master Jan 6, 2024
19 checks passed
@cblmemo cblmemo deleted the serve-doc-vllm-example branch January 6, 2024 05:21
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 6, 2024

This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md and also add the yaml in that folder?

Added in #2948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants