Skip to content

Conversation

@meher-m
Copy link
Contributor

@meher-m meher-m commented Sep 23, 2025

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

export TARGET_TAG=0.10.2-test-rc1 
export IMAGE=692474966980.dkr.ecr.us-west-2.amazonaws.com/vllm:${TARGET_TAG} 
export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct && export MODEL_PATH=/data/model_files/$MODEL
docker kill vllm; docker rm vllm;
docker run \                                                                                                                                 
    --runtime nvidia \                                                                                                                                                                           
    --shm-size=16gb \                                                                                                                                                                            
    --gpus '"device=1,2,3,4"' \                                                                                                                                                                  
    -v $MODEL_PATH:/workspace/model_files:ro -v /data/dmchoi:/data:ro \                                                                                                                          
    -p 5005:5005 \                                                                                                                                                                               
    --name vllm \                                                                                                                                                                                
    ${IMAGE} \                                                                                                                                                                                   
    python -m vllm_server --model model_files --served-model-name $MODEL model_files  --tensor-parallel-size 4 --port 5005 --disable-log-requests --uvicorn-log-level info --gpu-memory-utilizati
on 0.8 --enforce-eager

and then run tests:

curl -X POST localhost:5005/v1/chat/completions -H "Content-Type: application/json" \
          -d "{\"model\":\"$MODEL\", \"messages\":[{\"role\": \"user\", \"content\": \"Hey, what's the temperature in Paris right now?\"}],\"max_tokens\":100,\"temperature\":0.2,\"guided_regex\":\"Sean.*\"}"

and

curl -X POST localhost:5005/v1/responses -H "Content-Type: application/json" \
          -d "{\"model\":\"$MODEL\", \"input\":[{\"role\": \"user\", \"content\": \"Hey, what's the temperature in Paris right now?\"}],\"max_tokens\":100,\"temperature\":0.2,\"guided_regex\":\"Sean.*\"}"

@meher-m meher-m self-assigned this Sep 23, 2025

def parse_args(parser: FlexibleArgumentParser):
parser = make_arg_parser(parser)
parser.add_argument("--attention-backend", type=str, help="The attention backend to use")
Copy link
Collaborator

@dmchoiboi dmchoiboi Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove run_server_worker and run_server, and just use run_server from vllm

FORWARDER_STORAGE_LIMIT=FORWARDER_STORAGE_USAGE,
USER_CONTAINER_PORT=USER_CONTAINER_PORT,
FORWARDER_EXTRA_ROUTES=flavor.extra_routes,
FORWARDER_SYNC_ROUTES=[flavor.predict_route] + flavor.routes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add flavor.extra_routes here for backwards compatibility since the data models are saved to the database

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed a spot for sync routes

Copy link
Collaborator

@dmchoiboi dmchoiboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this change for CPU sync and stream endpoints as well?

Copy link
Collaborator

@dmchoiboi dmchoiboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're missing changes to the domain + database models

https://github.com/scaleapi/llm-engine/blob/55ff1dac87912b68a86c8f8e560a33847a7dcc99/model-engine/model_engine_server/domain/entities/model_bundle_entity.py#L154
https://github.com/scaleapi/llm-engine/blob/55ff1dac87912b68a86c8f8e560a33847a7dcc99/model-engine/model_engine_server/db/models/hosted_model_inference.py#L149

The database model change will require a db migration script to be created. I realize the Readme doesn't have instructions, but you should be able to follow https://alembic.sqlalchemy.org/en/latest/tutorial.html#running-our-second-migration. Could actually add that to db/migrations/README as well

COMMAND=flavor.streaming_command,
PREDICT_ROUTE=flavor.predict_route,
STREAMING_PREDICT_ROUTE=flavor.streaming_predict_route,
# PREDICT_ROUTE=flavor.predict_route,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@meher-m meher-m changed the title Update vllm upgrade process [MLI-4665] Update vllm upgrade process Sep 25, 2025
@meher-m meher-m requested a review from dmchoiboi September 29, 2025 20:48
@meher-m meher-m merged commit da85235 into main Sep 30, 2025
7 checks passed
@meher-m meher-m deleted the meher-m/vllm-upgrade branch September 30, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants