Skip to content

feat: backport monitoring metrics and fix stream label / serve_count bugs#4899

Merged
qinxuye merged 5 commits into
mainfrom
feat/metrics-backport-and-stream-fix
May 12, 2026
Merged

feat: backport monitoring metrics and fix stream label / serve_count bugs#4899
qinxuye merged 5 commits into
mainfrom
feat/metrics-backport-and-stream-fix

Conversation

@m199369309
Copy link
Copy Markdown
Collaborator

@m199369309 m199369309 commented May 12, 2026

Summary

  • Monitoring metrics backport (2.7.0 → 2.8.0): Restores 5 customized metric changes that were overwritten during the 2.8.0 upgrade:

    • model_last_load_duration_seconds Gauge with load timing in worker
    • _WORKER_ONLY_METRICS set + Supervisor-side deregister to avoid empty HELP/TYPE headers
    • image/audio/video engine/format label differentiation (empty labels instead of "unknown")
    • grafana_alert_datasource in UI config API + grafanaUtils.js URL params
    • @request_limit decorator on image_to_image, inpainting, ocr methods
    • Exception protection for record_metrics
  • Stream label fix: request_limit decorator now recognizes xoscar IteratorWrapper as a stream type, fixing stream="true" label never being recorded for LLM requests

  • serve_count double-decrement fix: Added stream type detection guard (isasyncgen/isgenerator/IteratorWrapper) to 3 stream_results() finally blocks, ensuring decrease_serve_count() is only called when the decorator has deferred the decrement to the caller. This prevents model_serve_count from going negative in two scenarios:

    • model.chat() fails before iterator is assigned (iterator is None)
    • model.chat() returns a non-stream value on stream=True path (iterator is not a generator/IteratorWrapper)

Test plan

  • Deploy and verify model_last_load_duration_seconds is reported after model load
  • Verify Supervisor /metrics endpoint no longer shows worker-only metric HELP/TYPE headers
  • Send stream + non-stream LLM requests, confirm model_request_total{stream="true"} and stream="false" are both recorded correctly
  • Verify model_serve_count stays ≥ 0 after failed stream requests
  • Verify grafana_alert_datasource appears in /v1/cluster/ui_config response
  • Verify image/audio model metrics show empty engine/format labels instead of "unknown"

m199369309 and others added 2 commits May 12, 2026 11:02
- Add model_last_load_duration_seconds Gauge and load timing in worker
- Add _WORKER_ONLY_METRICS set and deregister from Supervisor registry
- Differentiate engine/format labels for image/audio/video model types
- Add @request_limit to image_to_image, inpainting, and ocr methods
- Add grafana_alert_datasource to UI config API
- Add alert_datasource and from/to params to grafanaUtils.js
- Add exception protection to record_metrics

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add IteratorWrapper to stream detection in request_limit decorator,
  fixing stream="true" label never being recorded for LLM requests
- Guard decrease_serve_count() with iterator-not-None check in 3
  stream_results() finally blocks, preventing double-decrement when
  model.chat() fails before iterator is assigned

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@XprobeBot XprobeBot added this to the v2.x milestone May 12, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances metrics management by introducing model load duration tracking and separating worker-only metrics. It also improves API robustness by adding request limits to image models and refining serve count handling in streaming endpoints. Additionally, the UI and Grafana utilities were updated to support alert datasources and time ranges. Review feedback identifies opportunities to further stabilize serve count decrements by aligning API stream detection with decorator logic and suggests using relative imports for consistency.

Comment thread xinference/api/restful_api.py Outdated
Comment thread xinference/api/restful_api.py Outdated
Comment thread xinference/api/restful_api.py Outdated
Comment thread xinference/api/restful_api.py Outdated
m199369309 and others added 3 commits May 12, 2026 11:21
- Align API stream_results() finally blocks with request_limit decorator
  logic: check iterator type (isasyncgen/isgenerator/IteratorWrapper)
  before calling decrease_serve_count, preventing double-decrement when
  model returns non-stream value on stream=True path
- Use relative import for _WORKER_ONLY_METRICS for consistency

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 99fab44 into main May 12, 2026
10 of 13 checks passed
@qinxuye qinxuye deleted the feat/metrics-backport-and-stream-fix branch May 12, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants