-
Notifications
You must be signed in to change notification settings - Fork 72
Add EPP operational metrics #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you please add "Why is this needed:" for each proposed metrics? |
/assign liu-cong To address the above comment |
@JeffLuoo Added "whys", please take a look. |
/unassign |
Can you clarify on:
and
#581 mentions about the scheduling latency but the first metric is also about the scheduling latency.
Do you mean how long it takes for EPP to process the response header and then the response body? |
Sorry for the confusion. For 1, #581 captures the overall latency for the scheduler component, it's a sub task of this issue. For 2, the latency between EPP receives the response from the backend to EPP returns the response back to the LB. The intent behind this is to understand the overall latency cost added by EPP in the request/response paths. |
What would you like to be added:
I would like to add the following list of metrics:
[ ] Overall latency for EPP to receive a request and make a decision on the target pod. Why? EPP provides benefits with the cost of added latency. This is a generally useful metric to quantify the cost, as well as detect any regressions in EPP.
[ ] Overall latency for EPP to process the response. Why? Same as above.
[ ] Latency/error count when EPP scrapes model server metrics Why? The freshness of metrics is vital to the effectiveness of the EPP algorithm. These metrics help detect issues with metric scraping due to regressions of EPP/model server or transient failures.
[ ] Age of caches (the model server metrics, etc.) Why? This is the observed cache freshness. It can be a useful metric to alert on if the cache age exceeds certain limit.
[ ] Overall latency of each component (flow controller, scheduler) . Why? These component level metrics provide latency breakdown of the overall EPP latency, and help with identify bottlenecks, source of regressions, etc.
[ ] Future: Prefix cache, queuing metrics (latency, error, size, and any other custom metrics) Why? These are important operational metrics to monitor. For instance, if the queue size is hitting the limit, it's a signal to perhaps increase the size limit.
Why is this needed:
The text was updated successfully, but these errors were encountered: