-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RHODS Jupyter Probe Success Burn Rate #263
Conversation
34f383e
to
44b1a05
Compare
deploy.sh
Outdated
NOTEBOOK_SUFFIX="\/notebookController\/spawner" | ||
notebook_spawner_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'$NOTEBOOK_SUFFIX'" 2 30 | tr -d "'") | ||
rhods_dashboard_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'" 2 30 | tr -d "'") | ||
notebook_spawner_host=$(oc::wait::object::availability "oc get svc notebook-controller-service -n $ODH_PROJECT -o go-template --template='{{.metadata.name}}.{{.metadata.namespace}}.svc.local:8080/metrics{{println}}'" 10 40 | tr -d "'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atheo89 You'll need to escape the slash in /metrics
so that the sed
statement below ignores it after the $notebook_spawner_host
variable is processed by the shell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, Landon. I added the escape slash before the /metrics, I also noticed on the operator's logs, the notebook-controller-service
is not available and all the attempts of the availability function finished so as a result, it doesn't populate the notebook-spawner target. Do you think that I have to add extra attempts or time on the function or to place that line later on the deploy.sh file?
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
....
....
....
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found```
@atheo89 The alert is not fired if the controller pod errors out (e.g ImagePullBackOff) Is this expected? Note: I do see the alerts when the deployment is scaled down. |
@VaishnaviHire, I am not sure, but I suppose is fine in that case. Because if one out of two notebook controllers is working then we don't get the alert because the target on black-box exporter is populated by both controllers' services. |
I see the alert "RHODS Jupyter Probe Success Burn Rate" when both controllers are down too. I will test it on Monday more |
/lgtm Alert was triggered when both the controllers were scaled down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LaVLaS The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR fixes the RHODS Jupyter Probe Success Burn Rate alarm fault.
Description
After investigation with @lucferbux, we observed that during the kfnbc migration, we deleted the SLOs - JupterHub recording rules without updating them for the notebook controller.
Furthermore, we updated the notebook-spawner target to the blackbox exporter to watch the
notebook-controller-service
How Has This Been Tested?
Merge criteria:
[UPSTREAM]
has been prepended to the commit message.