Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RHODS Jupyter Probe Success Burn Rate #263

Merged
merged 1 commit into from
Oct 10, 2022

Conversation

atheo89
Copy link
Member

@atheo89 atheo89 commented Sep 27, 2022

This PR fixes the RHODS Jupyter Probe Success Burn Rate alarm fault.

Description

After investigation with @lucferbux, we observed that during the kfnbc migration, we deleted the SLOs - JupterHub recording rules without updating them for the notebook controller.
Furthermore, we updated the notebook-spawner target to the blackbox exporter to watch the notebook-controller-service

How Has This Been Tested?

  • Scale down to 0 rhods-operator
  • Scale down to 0 notebook-controller-deployment
  • Scale down to 0 odh-notebook-controller-manager
  • Wait 5 mins until alerts "Kubeflow notebook controller pod is not running" and "ODH notebook controller pod is not running" are firing
  • Try to spawn a notebook. You'll see the error "Failed to create a notebook, please try again later"
  • Verify that the alert "RHODS Jupyter Probe Success Burn Rate" is firing after 10++ mins

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • For commits that came from upstream, [UPSTREAM] has been prepended to the commit message.
  • JIRA link(s): https://issues.redhat.com/browse/RHODS-5205
  • The Jira story is acked.
  • Live build image: quay.io/modh/rhods-operator-live-catalog:1.18.0-rhods-5205
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • QE contact acknowledges that this has been tested and is approved for merge.

deploy.sh Outdated
NOTEBOOK_SUFFIX="\/notebookController\/spawner"
notebook_spawner_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'$NOTEBOOK_SUFFIX'" 2 30 | tr -d "'")
rhods_dashboard_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'" 2 30 | tr -d "'")
notebook_spawner_host=$(oc::wait::object::availability "oc get svc notebook-controller-service -n $ODH_PROJECT -o go-template --template='{{.metadata.name}}.{{.metadata.namespace}}.svc.local:8080/metrics{{println}}'" 10 40 | tr -d "'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atheo89 You'll need to escape the slash in /metrics so that the sed statement below ignores it after the $notebook_spawner_host variable is processed by the shell

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Landon. I added the escape slash before the /metrics, I also noticed on the operator's logs, the notebook-controller-service is not available and all the attempts of the availability function finished so as a result, it doesn't populate the notebook-spawner target. Do you think that I have to add extra attempts or time on the function or to place that line later on the deploy.sh file?

Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
....
....
....
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found
Error from server (NotFound): services "notebook-controller-service" not found```

@VaishnaviHire
Copy link
Contributor

VaishnaviHire commented Oct 6, 2022

@atheo89 The alert is not fired if the controller pod errors out (e.g ImagePullBackOff) Is this expected?

Note: I do see the alerts when the deployment is scaled down.

Screen Shot 2022-10-06 at 8 14 18 PM

Screen Shot 2022-10-06 at 8 14 47 PM

Screen Shot 2022-10-06 at 8 19 03 PM

@atheo89
Copy link
Member Author

atheo89 commented Oct 6, 2022

@VaishnaviHire, I am not sure, but I suppose is fine in that case. Because if one out of two notebook controllers is working then we don't get the alert because the target on black-box exporter is populated by both controllers' services.

@jgarciao
Copy link

jgarciao commented Oct 7, 2022

I see the alert "RHODS Jupyter Probe Success Burn Rate" when both controllers are down too. I will test it on Monday more

@VaishnaviHire
Copy link
Contributor

/lgtm

Alert was triggered when both the controllers were scaled down

Copy link
Contributor

@LaVLaS LaVLaS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Oct 10, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LaVLaS

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LaVLaS LaVLaS merged commit 2e34a74 into red-hat-data-services:main Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants