Skip to content

Conversation

qandrew
Copy link
Contributor

@qandrew qandrew commented Sep 10, 2025

Purpose

  • add IncompleteDetails to vllm's implementation of ResponsesResponses.
  • first use case: if the max token length has been hit, we should add the reason for incompleteness. this is added in the non streaming version, the streaming version will be a follow up PR (bc it needs to be based off of [gpt-oss][1][bugfix] fix streaming final output #24466
  • fix some formatting issues

I do realize there are other things we need to fix for IncompleteDetails; for now we'll guarantee that IF incomplete details is outputted, it is true. We'll still need to test flows like if the generator was interrupted abruptly, and output the reason why. And I don't think GPT-OSS does content filters for now but that is another IncompleteDetails implementation.

Test Plan

Server

(gpt_oss_edit) [axia@devvm30969.cln0 /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ CUDA_VISIBLE_DEVICES=2,3 with-proxy vllm serve "/data/users/axia/checkpoints/gpt-oss-120b" -tp 2 --port 20001

Client

(gpt_oss_edit) [axia@devvm30969.cln0 /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ curl http://localhost:20001/v1/responses   -H "Content-Type: application/json"   -N   -d '{
    "model": "/data/users/axia/checkpoints/gpt-oss-120b",
    "input": [
        {
            "role": "user",
            "content": "Write two paragraphs on the weather."
        }
    ],
    "temperature": 0.7,
    "max_output_tokens": 256
}' | jq
# outputs 
{
  "id": "resp_5b6f93e8efa4445497f6b6bc052b6dac",
  "created_at": 1757701130,
  "incomplete_details": {
    "reason": "max_output_tokens" # we get incomplete reasons here
  },
  "instructions": null,
  "metadata": null,
  "model": "/data/users/axia/checkpoints/gpt-oss-120b",
  "object": "response",
  "output": [
    {
      "id": "rs_541bc7e4806f4630b04703e72e14e025",
      "summary": [],
      "type": "reasoning",
      "content": [
        {
          "text": "User asks: \"Write two paragraphs on the weather.\" We need to produce two paragraphs about weather. Should be descriptive, could talk about different aspects. Probably just two paragraphs. Ensure it's coherent. No extra instructions.",
          "type": "reasoning_text"
        }
      ],
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_122778466dba478aa761b0542f2cf81f",
      "content": [
        {
          "annotations": [],
          "text": "The sky stretched a bruised violet this morning, the first hints of sunrise shyly peeking through a thin veil of low‑lying clouds. A gentle breeze whispered through the oak leaves, carrying the faint scent of damp earth and pine sap, while droplets of drizzle clung to the windowpanes like tiny crystal beads. The temperature hovered just above the dew point, making the air feel cool enough to pull a light sweater from the closet, yet not so cold as to bite the skin. As the sun climbed higher, its golden rays began to melt the lingering mist, turning the wet streets into shimmering ribbons that reflected the city’s hurried rhythm.\n\nBy afternoon, the weather shifted dramatically. Dark, billowing cumulonimbus towers gathered on the horizon, their edges tinged with electric blue, signalling an approaching thunderstorm. A sudden gust surged through the streets, rattling shutters and sending loose papers swirling in a chaotic dance. The first crack of thunder rolled like distant drums, followed quickly by a cascade of rain that",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 0.7,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": 256,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "reasoning": null,
  "service_tier": "auto",
  "status": "incomplete",
  "text": null,
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 72,
    "input_tokens_details": {
      "cached_tokens": 64
    },
    "output_tokens": 256,
    "output_tokens_details": {
      "reasoning_tokens": 44,
      "tool_output_tokens": 0
    },
    "total_tokens": 328
  },
  "user": null
}

Client, where we don't hit max token limit

(gpt_oss_edit) [axia@devvm30969.cln0 /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ curl http://localhost:20001/v1/responses   -H "Content-Type: application/json"   -N   -d '{
    "model": "/data/users/axia/checkpoints/gpt-oss-120b",
    "input": [
        {
            "role": "user",
            "content": "Write two words on the weather."
        }
    ],
    "temperature": 0.7,
    "max_output_tokens": 256
}' 
# output
{
  "id": "resp_60215e7212c044669bf5b85015fe19e4",
  "created_at": 1757701190,
  "incomplete_details": null,
  "instructions": null,
  "metadata": null,
  "model": "/data/users/axia/checkpoints/gpt-oss-120b",
  "object": "response",
  "output": [
    {
      "id": "rs_02262a1cf9cf4ff3891d07d282f2e176",
      "summary": [],
      "type": "reasoning",
      "content": [
        {
          "text": "The user asks: \"Write two words on the weather.\" Probably they want a short phrase of two words describing the weather. Could be \"sunny skies\", \"stormy night\", etc. Provide two words. Could also be a short description: \"Cloudy day\". Provide two words. Probably just output two words. I'll respond with two words.",
          "type": "reasoning_text"
        }
      ],
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_52743c005ca94adf9687a4ffebb0107e",
      "content": [
        {
          "annotations": [],
          "text": "Sunny skies.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 0.7,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": 256,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "reasoning": null,
  "service_tier": "auto",
  "status": "completed",
  "text": null,
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 72,
    "input_tokens_details": {
      "cached_tokens": 64
    },
    "output_tokens": 84,
    "output_tokens_details": {
      "reasoning_tokens": 72,
      "tool_output_tokens": 0
    },
    "total_tokens": 156
  },
  "user": null
}

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@meta.com>
@mergify mergify bot added frontend gpt-oss Related to GPT-OSS models v1 labels Sep 10, 2025
Signed-off-by: Andrew Xia <axia@meta.com>
Copy link

mergify bot commented Sep 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qandrew.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 12, 2025
Signed-off-by: Andrew Xia <axia@meta.com>
@mergify mergify bot removed the needs-rebase label Sep 12, 2025
Signed-off-by: Andrew Xia <axia@meta.com>
@qandrew qandrew force-pushed the andrew/incomplete-details branch from 1fe6099 to 1718d05 Compare September 12, 2025 22:35
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the test plan! cc: @heheda12345 @houseroad

Comment on lines +1900 to +1902
# TODO: implement the other reason for incomplete_details,
# which is content_filter
# incomplete_details = IncompleteDetails(reason='content_filter')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's missing from current logic btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think VLLM baseline implementation supports content filter as an abort reason currently: https://github.com/vllm-project/vllm/blob/main/vllm/v1/request.py#L206

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the parser still has messages (ie if the generator got cut abruptly, this should be incomplete and not completed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense please move your comment to the code :)

Comment on lines +1900 to +1902
# TODO: implement the other reason for incomplete_details,
# which is content_filter
# incomplete_details = IncompleteDetails(reason='content_filter')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think VLLM baseline implementation supports content filter as an abort reason currently: https://github.com/vllm-project/vllm/blob/main/vllm/v1/request.py#L206

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 14, 2025
Signed-off-by: Andrew Xia <axia@meta.com>
@yeqcharlotte
Copy link
Collaborator

@qandrew @houseroad this needs a rebase over the structured output disable commit

@houseroad houseroad merged commit 25aba2b into vllm-project:main Sep 15, 2025
45 checks passed
tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
QierLi pushed a commit to QierLi/vllm that referenced this pull request Oct 5, 2025
Signed-off-by: bbartels <benjamin@bartels.dev>

[gpt-oss] Add IncompleteDetails to ResponsesRepsonse (vllm-project#24561)

Signed-off-by: Andrew Xia <axia@meta.com>

[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (vllm-project#24759)

Signed-off-by: Andrew Xia <axia@meta.com>

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-project#24891)

[Bug] Fix Cutlass Scaled MM Compilation Error (vllm-project#24887)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

[ci] fix wheel names for arm wheels (vllm-project#24898)

Signed-off-by: simon-mo <simon.mo@hey.com>

[Tests] fix initialization of kv hash in tests (vllm-project#24273)

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

[Compile] Fix noop_elimination pass and add tests for noop_elimination (vllm-project#24880)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Propagate entire tokens to connector for resumed preemptions

Signed-off-by: Qier Li <kevin44036@gmail.com>

Fix pre-commit

Signed-off-by: Qier Li <kevin44036@gmail.com>

Rename field and nullify empty lists

Signed-off-by: Qier Li <kevin44036@gmail.com>

Update vllm/v1/core/sched/scheduler.py

Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>

Add unit test for preemption resumption

Signed-off-by: Qier Li <kevin44036@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants