feat: backport monitoring metrics and fix stream label / serve_count bugs#4899
Merged
Conversation
- Add model_last_load_duration_seconds Gauge and load timing in worker - Add _WORKER_ONLY_METRICS set and deregister from Supervisor registry - Differentiate engine/format labels for image/audio/video model types - Add @request_limit to image_to_image, inpainting, and ocr methods - Add grafana_alert_datasource to UI config API - Add alert_datasource and from/to params to grafanaUtils.js - Add exception protection to record_metrics Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add IteratorWrapper to stream detection in request_limit decorator, fixing stream="true" label never being recorded for LLM requests - Guard decrease_serve_count() with iterator-not-None check in 3 stream_results() finally blocks, preventing double-decrement when model.chat() fails before iterator is assigned Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request enhances metrics management by introducing model load duration tracking and separating worker-only metrics. It also improves API robustness by adding request limits to image models and refining serve count handling in streaming endpoints. Additionally, the UI and Grafana utilities were updated to support alert datasources and time ranges. Review feedback identifies opportunities to further stabilize serve count decrements by aligning API stream detection with decorator logic and suggests using relative imports for consistency.
- Align API stream_results() finally blocks with request_limit decorator logic: check iterator type (isasyncgen/isgenerator/IteratorWrapper) before calling decrease_serve_count, preventing double-decrement when model returns non-stream value on stream=True path - Use relative import for _WORKER_ONLY_METRICS for consistency Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Monitoring metrics backport (2.7.0 → 2.8.0): Restores 5 customized metric changes that were overwritten during the 2.8.0 upgrade:
model_last_load_duration_secondsGauge with load timing in worker_WORKER_ONLY_METRICSset + Supervisor-side deregister to avoid empty HELP/TYPE headersgrafana_alert_datasourcein UI config API + grafanaUtils.js URL params@request_limitdecorator onimage_to_image,inpainting,ocrmethodsrecord_metricsStream label fix:
request_limitdecorator now recognizes xoscarIteratorWrapperas a stream type, fixingstream="true"label never being recorded for LLM requestsserve_count double-decrement fix: Added stream type detection guard (
isasyncgen/isgenerator/IteratorWrapper) to 3stream_results()finally blocks, ensuringdecrease_serve_count()is only called when the decorator has deferred the decrement to the caller. This preventsmodel_serve_countfrom going negative in two scenarios:model.chat()fails before iterator is assigned (iterator is None)model.chat()returns a non-stream value onstream=Truepath (iterator is not a generator/IteratorWrapper)Test plan
model_last_load_duration_secondsis reported after model load/metricsendpoint no longer shows worker-only metric HELP/TYPE headersmodel_request_total{stream="true"}andstream="false"are both recorded correctlymodel_serve_countstays ≥ 0 after failed stream requestsgrafana_alert_datasourceappears in/v1/cluster/ui_configresponse