Skip to content

Conversation

@meher-m
Copy link
Contributor

@meher-m meher-m commented Sep 23, 2025

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Update the http forwarder for model engine to have a new routes field. For now this is used in the same way as extra_routes and we no longer hard code adding /predict and /stream. The end state will be removing /predict and /stream and using the forwarder just as a passthrough for any endpoint specified in routes (getting rid of extra_routes). This is the first step towards that while maintaining backwards compatibility so we don't force stakeholders to migrate for now.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Start the server

(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ export TARGET_TAG=0.10.2-test-rc1                                                                                                         
(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ export IMAGE=692474966980.dkr.ecr.us-west-2.amazonaws.com/vllm:${TARGET_TAG}                                                              
                                                                                                                                                                                                     
(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct && export MODEL_PATH=/data/model_files/$MODEL                                          
(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ export REPO_PATH=/mnt/efs/mehermankikar                                                                                                   
(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ docker kill vllm; docker rm vllm;  
(base) ➜  vllm git:(meher-m/vllm-upgrade-http-forwarder) ✗ docker run \
    --runtime nvidia \
    --shm-size=16gb \
    --gpus '"device=0,1,2,3"' \
    -v $MODEL_PATH:/workspace/model_files:ro -v /data/dmchoi:/data:ro \
    -p 5005:5005 \
    --name vllm \
    ${IMAGE} \
    python -m vllm_server --model model_files --served-model-name $MODEL model_files  --tensor-parallel-size 4 --port 5005 --disable-log-requests --uvicorn-log-level info --gpu-memory-utilization 0.8 --enforce-eager

Test 1

start the forwarder using extra_routes

GIT_TAG=test python model_engine_server/inference/forwarding/http_forwarder.py \
    --config model_engine_server/inference/configs/service--http_forwarder.yaml \
    --num-workers 1 \
    --set "forwarder.sync.extra_routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.stream.extra_routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.sync.healthcheck_route=/health" \
    --set "forwarder.stream.healthcheck_route=/health"

Test a curl command

curl -X POST localhost:5000/v1/chat/completions  -H "Content-Type: application/json" -d "{\"args\": {\"model\":\"/data/model_files/$MODEL\", \"messages\":[{\"role\": \"systemr\", \"content\": \"Hey, what's the temperature in Paris right now?\"}],\"max_tokens\":100,\"temperature\":0.2,\"guided_regex\":\"Sean.*\"}}"

works

Test 2

start the forwarder using routes

GIT_TAG=test python model_engine_server/inference/forwarding/http_forwarder.py \
    --config model_engine_server/inference/configs/service--http_forwarder.yaml \
    --num-workers 1 \
    --set "forwarder.sync.routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.stream.routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.sync.healthcheck_route=/health" \
    --set "forwarder.stream.healthcheck_route=/health"

test the same CURL command and it still works.

Test 3

started the forwarder using no routes

GIT_TAG=test python model_engine_server/inference/forwarding/http_forwarder.py \
    --config model_engine_server/inference/configs/service--http_forwarder.yaml \
    --num-workers 1 \
    --set "forwarder.sync.healthcheck_route=/health" \
    --set "forwarder.stream.healthcheck_route=/health"

Confirmed that the same CURL request failed as expected.

(base) ➜  llm-engine git:(meher-m/vllm-upgrade-http-forwarder) ✗ curl -X POST localhost:5000/v1/chat/completions  -H "Content-Type: application/json" -d "{\"args\": {\"model\":\"$MODEL\", \"messages\":[{\"role\": \"systemr\", \"content\": \"Hey, what's the temperature in Paris right now?\"}],\"max_tokens\":100,\"temperature\":0.2,\"guided_regex\":\"Sean.*\"}}"
{"detail":"Not Found"}% 

@meher-m meher-m self-assigned this Sep 23, 2025
protocol: Literal["http"] # TODO: add support for other protocols (e.g. grpc)
readiness_initial_delay_seconds: int = 120
extra_routes: List[str] = Field(default_factory=list)
routes: Optional[List[str]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do the same as above with Field(default_factory=list) to make it non-optional in code

forward_http_status: true
extra_routes: []
extra_routes: [] # Legacy field - still supported for backwards compatibility
# routes: [] # New field - can be used alongside or instead of extra_routes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to comment

stream_forwarders[route] = load_streaming_forwarder(route)

# Add hardcoded routes to forwarders so they get handled consistently
sync_forwarders["/predict"] = load_forwarder(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate this on whether sync.predict_route is provided, same with stream

sync_forwarders: Dict[str, Forwarder] = dict()
stream_forwarders: Dict[str, StreamingForwarder] = dict()

# Handle legacy extra_routes configuration (backwards compatibility)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to deduplicate routes (e.g. place them into a set) before initializing the forwarders

passthrough_forwarders: Dict[str, PassthroughForwarder] = dict()

# Handle legacy extra_routes configuration (backwards compatibility)
for route in config.get("sync", {}).get("extra_routes", []):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, let's deduplicate and aggregate routes before creating the forwarders

sync_routes_to_add.update(config.get("sync", {}).get("extra_routes", []))
sync_routes_to_add.update(config.get("sync", {}).get("routes", []))

if config.get("sync", {}).get("predict_route", None) is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe don't need the if statement anymore?

protocol: Literal["http"] # TODO: add support for other protocols (e.g. grpc)
readiness_initial_delay_seconds: int = 120
extra_routes: List[str] = Field(default_factory=list)
routes: Optional[List[str]] = Field(default_factory=list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'Optional'

sync_passthrough_routes_to_add = set()
sync_passthrough_routes_to_add.update(config.get("sync", {}).get("extra_routes", []))
sync_passthrough_routes_to_add.update(config.get("sync", {}).get("routes", []))
if config.get("sync", {}).get("predict_route", None) != "/predict":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, passthrough is a different case. We don't need predict_route for this. same with stream

@meher-m meher-m requested a review from dmchoiboi September 24, 2025 17:19
sync_routes_to_add.update(config.get("sync", {}).get("extra_routes", []))
sync_routes_to_add.update(config.get("sync", {}).get("routes", []))

if config.get("sync", {}).get("predict_route", None) == "/predict":
Copy link
Collaborator

@dmchoiboi dmchoiboi Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to add config.get("sync", {}).get("predict_route", None) to sync_routes_to_add, and not necessarily only if it's equal to predict

app.add_api_route(path="/predict", endpoint=predict, methods=["POST"])
app.add_api_route(path="/stream", endpoint=stream, methods=["POST"])
# app.add_api_route(path="/predict", endpoint=predict, methods=["POST"])
# app.add_api_route(path="/stream", endpoint=stream, methods=["POST"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@meher-m meher-m changed the title Update http forwarder for model engine [MLI-4665] Update http forwarder for model engine Sep 24, 2025
@meher-m meher-m enabled auto-merge (squash) September 24, 2025 18:29
…eapi/llm-engine into meher-m/vllm-upgrade-http-forwarder
@meher-m meher-m merged commit 546eeff into main Sep 24, 2025
7 checks passed
@meher-m meher-m deleted the meher-m/vllm-upgrade-http-forwarder branch September 24, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants