Description
What happened:
As this guide says, the normalized_time_per_output_token_seconds
metric is supported by EPP, which means "Distribution of ntpot (response latency per output token)".
However, this metric is not actually being recorded in the latest EPP code. The RecordNormalizedTimePerOutputToken
function here is only called by unit test.
What you expected to happen:
normalized_time_per_output_token_seconds
should be recored and exposed by EPP after generating streaming response.
How to reproduce it (as minimally and precisely as possible):
I actually discovered this issue while providing e2e testing for metrics. You can refer to my branch (#938). Remove the line marked with TODO and run the e2e tests to observe the problem.
Anything else we need to know?:
I find the guidance documentation for the metrics to also be problematic.
- I think the
normalized_time_per_output_token_seconds
metric mentioned in the document should actually be theinference_model_normalized_time_per_output_token_seconds
metric. The prefix for the subsystem is not included here. - The doc says:
To have response metrics, ensure the body mode is set to Buffered or Streamed (this should be the default behavior for all implementations)."
Is the description here somewhat outdated? As far as I know, EPP now exclusively uses FULL_DUPLEX_STREAMED
.
Environment:
- Kubernetes version (use
kubectl version
): - Inference extension version (use
git describe --tags --dirty --always
): - Cloud provider or hardware configuration:
- Install tools:
- Others: