-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update rhods prometheus stack according the kfnbc migration #256
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adding in /cc @samu to be aware for the PRs that deal with the changes on the monitoring stack |
Some notes and thoughts regarding the third point of this PR Case 1: Case 2: I believe the reason that we don't see any downtime while we scrape these targets is that the kfnbc spawner is embedded on rhods dashboard. Any alternative thoughts regarding how to scrape or not the notebook spawner? |
04e9927
to
e32b3d1
Compare
@atheo89 these changes look good to me (other than the small whitespace nit). Do you have a live build I can test with? |
@atheo89 the prometheus stack init container will also need to be updated to account for traefik proxy |
@anishasthana, I should remove the command, right? |
It would be better to update it so that prometheus is waiting for one of the notebook controller pods to come up. |
Note: Added on this PR the changes for the fault firing of the kfnbc alerts, and I closed the old one. #255 |
A new live build with these changes is available at quay.io/accorvin/rhods-operator-live-catalog:1.0.0-rhods-4765 |
@atheo89 I'm getting an error in the deployer init container:
Edit: the pod is here if you want to look at it: https://console-openshift-console.apps.acorvin.mxg0.s1.devshift.org/k8s/ns/redhat-ods-operator/pods/rhods-operator-5745bdcc4f-zthw4/logs?container=rhods-deployer |
I believe I've found the problem that was causing the deploy script to fail. I've added b9eb014 with a fix and am building a new live image now. |
rhods_dashboard_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'" 2 30 | tr -d "'") | ||
|
||
sed -i "s/<jupyterhub_host>/$jupyterhub_host/g" monitoring/prometheus/prometheus-configs.yaml | ||
NOTEBOOK_SUFFIX="/notebookController/spawner" | ||
notebook_spawner_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'$NOTEBOOK_SUFFIX'" 2 30 | tr -d "'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable contains the route to the new spawner page (e.g. https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner), that is used in the alert RHODS Jupyter Probe Success Burn Rate
What I have experimented in the livebuild is that, if the Kubeflow notebook controller pod and the ODH notebook controller pod are not running, the alert RHODS Jupyter Probe Success Burn Rate does not fire.
I believe the reason is that https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner renders the page correctly (returning HTTP 200) even if those pods are not running
What happens in that situation is that, when you try to spawn a notebook, you get this error:
So, I don't know if the alert RHODS Jupyter Probe Success Burn Rate is useful with this implementation, having already RHODS Dashboard Probe Success Burn Rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucferbux @andrewballantyne this to me seems like functionality that should perhaps change on the notebook spawner page. If the backend dependencies aren't met, does it make sense for the spawner page to generate a 200 status code when you load it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want the dashboard in the business of checking for status of backend controllers. I think we'd do a poor job of it since we can't really anticipate the ways in which it fails for any given controller. Honestly, I think this UX is pretty good the way it is since it's properly conveying an error from the notebook spawn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to change something around how this works - either what we’re probing or how the page behaves. Note: I don’t think we should make this change now, but note it as a deficiency and fix it in 1.17.
Right now, kfnbc functionality other than the spawner UI can be broken and we wouldn’t know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be easy to add liveness probes to all the backend controllers so we aren't ever in this situatoin, shouldn't it?
The issue with the odh deployer init container failing has been resolved by my latest change (this change is available in the live build, I pushed the new version to the same tag). I'm seeing an issue now where the probes are getting 403 errors. I'm investigating why that is happening. |
Update: I found that the opendatahub-operator container in my live build did not properly include the changes from red-hat-data-services/odh-manifests#231 I'm generating a new live build that resolves that and will post with the status of my testing. |
Update: this looks to be working correctly now with the latest version of my live build |
Rhods livebuild for these changes with KFNBC and migration script: quay.io/llasmith/rhods-operator-live-catalog:1.16.0-dashboard-kfnbc |
Deleted JupyterHub from Prometheus configuration Changed alert name from JupyterHub to Jupyter on Builds rules Added kube_persistentvolumeclaim_info on Federate Prometheus to retrive information of the PVCs Added Usage Metrics rules: rhods_total_users and rhods_current_users Added monitoring target for Jupyter notebook spawner page Added alert block that fires the notebook spawner Changed the expresion of RHODS Probe Success Burn Rate to look on name label instead of instance to keep an uniformity with the spawner's alerts Updated rhods_aggregate_availability metric to display the notebook spawner Changed alerting rule name to RHODS Dashboard Probe Success Burn Rate Updated command on InitContainer to watch odh-notebook-controller-service instead Traefik-proxy Corrected kfnbc alert rule expressions to firing when the pods are unrcachable Updated the triage urls for SRE Use auth when probing dashboard with the blabkbox exporter Escape slashes in the NOTEBOOK_SUFFIX variable These slashes broke the sed command on line 226. Escaping them resutls in the variable being properly set. Corrected Usage Metrics recording rules
We've pulled Adriana's changes into #257 |
These changes have been pulled into #257 |
This PR cleans up the obsolete monitoring targets, alerts, and recording rules from the Prometheus configuration
Description
Applied the following changes:
How Has This Been Tested?
Merge criteria:
[UPSTREAM]
has been prepended to the commit message.