Update rhods prometheus stack according the kfnbc migration #256

atheo89 · 2022-08-10T15:25:47Z

This PR cleans up the obsolete monitoring targets, alerts, and recording rules from the Prometheus configuration

Description

Applied the following changes:

Removed Traefik
Removed Jupyterhub and JupyterhubDB
Added monitoring target for Jupyter notebook spawner page (Dependent by this PR Configure the blackbox exporter to authenticate to probed services #258)
Updated the rhods_aggregate_availability metric to check also the spawner
Corrected kfnbc alert rule expressions to firing when the pods are unreachable

How Has This Been Tested?

Ensure that Traefik proxy doesn't exist on recording rules and on RHODS Probe Success Burn Rate alert rules
Ensure that Jupyterhub doesn't exist on recording rules
To ensure that Usage Metrics work
- for rhods_total_users try to create and delete multiple notebooks with different users
- for rhods_active_users try to create multiple notebooks with different users
Ensure that alerts for Jupyter notebook spawner are working
- to check if the spawner reacts as expected delete the ServiceAccounts notebook-controller-service-account, odh-notebook-controller-manager, and rhods-dashboard and monitor the downtime via rhods_aggregate_availability (That is going to disturb the communication)
Ensure that the rhods_aggregate_availability metric displays correctly the downtime of rhods-dashboard, notebook-spawner and combined
To test the KFNBC alerts:
- Ensure that the notebook-controller-deployment-xxxxx-xxx is down for more than 5 mins
- Ensure that the odh-notebook-controller-manager-xxxxx-xxx is down for more than 5 mins

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
For commits that came from upstream, [UPSTREAM] has been prepended to the commit message.
JIRA link(s): https://issues.redhat.com/browse/RHODS-4765
The Jira story is acked.
Live build image: quay.io/accorvin/rhods-operator-live-catalog:1.0.0-rhods-4765
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work.
QE contact acknowledges that this has been tested and is approved for merge.

openshift-ci · 2022-08-10T15:25:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval by writing /assign @anishasthana in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atheo89 · 2022-08-11T07:49:32Z

Adding in /cc @samu to be aware for the PRs that deal with the changes on the monitoring stack

atheo89 · 2022-08-11T10:46:54Z

Some notes and thoughts regarding the third point of this PR Add monitoring target for Jupyter notebook spawner page

Case 1:
Adding the URLs
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController/spawner or
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController
as targets into the job: user_facing_endpoints_status and alongside deleting the ServiceAccounts notebook-controller-service-account, odh-notebook-controller-manager and rhods-dashboard (Having the operator scaled down) I don't see any downtime on the targets via rhods_aggregate_availability metric. (The status of the scraping targets is UP)

Case 2:
Adding the URLs
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController/spawner or
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController
as targets into the job: user_facing_endpoints_status and stop notebook server (via rhods dashboard) I don't see any downtime again through rhods_aggregate_availability metric.

I believe the reason that we don't see any downtime while we scrape these targets is that the kfnbc spawner is embedded on rhods dashboard.

Any alternative thoughts regarding how to scrape or not the notebook spawner?

monitoring/prometheus/prometheus-configs.yaml

accorvin · 2022-08-22T12:20:57Z

@atheo89 these changes look good to me (other than the small whitespace nit). Do you have a live build I can test with?

atheo89 · 2022-08-22T12:31:45Z

@accorvin today, we had a discussion with @samuelvl about the live build. He is going to bunch all the monitoring changes together and he will provide a live build. Once it will be ready we will add the link to it.

anishasthana · 2022-08-23T12:47:31Z

@atheo89 the prometheus stack init container will also need to be updated to account for traefik proxy
https://github.com/red-hat-data-services/odh-deployer/blob/main/monitoring/prometheus/prometheus.yaml#L22-L35

atheo89 · 2022-08-23T12:52:38Z

@anishasthana, I should remove the command, right?

anishasthana · 2022-08-23T13:00:03Z

It would be better to update it so that prometheus is waiting for one of the notebook controller pods to come up.

atheo89 · 2022-08-24T10:37:19Z

Note: Added on this PR the changes for the fault firing of the kfnbc alerts, and I closed the old one. #255

monitoring/prometheus/prometheus-configs.yaml

accorvin · 2022-08-25T15:04:06Z

A new live build with these changes is available at quay.io/accorvin/rhods-operator-live-catalog:1.0.0-rhods-4765

accorvin · 2022-08-25T15:51:16Z

@atheo89 I'm getting an error in the deployer init container:

sed: -e expression #1, char 101: unknown option to `s'

Edit: the pod is here if you want to look at it: https://console-openshift-console.apps.acorvin.mxg0.s1.devshift.org/k8s/ns/redhat-ods-operator/pods/rhods-operator-5745bdcc4f-zthw4/logs?container=rhods-deployer

deploy.sh

accorvin · 2022-08-25T16:16:20Z

I believe I've found the problem that was causing the deploy script to fail. I've added b9eb014 with a fix and am building a new live image now.

monitoring/prometheus/prometheus-configs.yaml

jgarciao · 2022-08-25T16:29:54Z

deploy.sh

 rhods_dashboard_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'" 2 30 | tr -d "'")

-sed -i "s/<jupyterhub_host>/$jupyterhub_host/g" monitoring/prometheus/prometheus-configs.yaml
+NOTEBOOK_SUFFIX="/notebookController/spawner"
+notebook_spawner_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'$NOTEBOOK_SUFFIX'" 2 30 | tr -d "'")


This variable contains the route to the new spawner page (e.g. https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner), that is used in the alert RHODS Jupyter Probe Success Burn Rate

What I have experimented in the livebuild is that, if the Kubeflow notebook controller pod and the ODH notebook controller pod are not running, the alert RHODS Jupyter Probe Success Burn Rate does not fire.

I believe the reason is that https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner renders the page correctly (returning HTTP 200) even if those pods are not running

What happens in that situation is that, when you try to spawn a notebook, you get this error:

So, I don't know if the alert RHODS Jupyter Probe Success Burn Rate is useful with this implementation, having already RHODS Dashboard Probe Success Burn Rate

@lucferbux @andrewballantyne this to me seems like functionality that should perhaps change on the notebook spawner page. If the backend dependencies aren't met, does it make sense for the spawner page to generate a 200 status code when you load it?

I don't think we want the dashboard in the business of checking for status of backend controllers. I think we'd do a poor job of it since we can't really anticipate the ways in which it fails for any given controller. Honestly, I think this UX is pretty good the way it is since it's properly conveying an error from the notebook spawn.

We definitely need to change something around how this works - either what we’re probing or how the page behaves. Note: I don’t think we should make this change now, but note it as a deficiency and fix it in 1.17.

Right now, kfnbc functionality other than the spawner UI can be broken and we wouldn’t know about it.

It should be easy to add liveness probes to all the backend controllers so we aren't ever in this situatoin, shouldn't it?

accorvin · 2022-08-25T17:20:38Z

The issue with the odh deployer init container failing has been resolved by my latest change (this change is available in the live build, I pushed the new version to the same tag).

I'm seeing an issue now where the probes are getting 403 errors. I'm investigating why that is happening.

accorvin · 2022-08-25T17:34:29Z

Update: I found that the opendatahub-operator container in my live build did not properly include the changes from red-hat-data-services/odh-manifests#231

I'm generating a new live build that resolves that and will post with the status of my testing.

accorvin · 2022-08-25T18:18:55Z

Update: this looks to be working correctly now with the latest version of my live build

LaVLaS · 2022-08-26T02:57:57Z

Rhods livebuild for these changes with KFNBC and migration script: quay.io/llasmith/rhods-operator-live-catalog:1.16.0-dashboard-kfnbc

Deleted JupyterHub from Prometheus configuration Changed alert name from JupyterHub to Jupyter on Builds rules Added kube_persistentvolumeclaim_info on Federate Prometheus to retrive information of the PVCs Added Usage Metrics rules: rhods_total_users and rhods_current_users Added monitoring target for Jupyter notebook spawner page Added alert block that fires the notebook spawner Changed the expresion of RHODS Probe Success Burn Rate to look on name label instead of instance to keep an uniformity with the spawner's alerts Updated rhods_aggregate_availability metric to display the notebook spawner Changed alerting rule name to RHODS Dashboard Probe Success Burn Rate Updated command on InitContainer to watch odh-notebook-controller-service instead Traefik-proxy Corrected kfnbc alert rule expressions to firing when the pods are unrcachable Updated the triage urls for SRE Use auth when probing dashboard with the blabkbox exporter Escape slashes in the NOTEBOOK_SUFFIX variable These slashes broke the sed command on line 226. Escaping them resutls in the variable being properly set. Corrected Usage Metrics recording rules

anishasthana · 2022-08-26T15:08:11Z

We've pulled Adriana's changes into #257

atheo89 · 2022-08-29T08:45:38Z

These changes have been pulled into #257

openshift-ci bot added the do-not-merge/work-in-progress label Aug 10, 2022

atheo89 marked this pull request as ready for review August 11, 2022 11:25

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 11, 2022

openshift-ci bot requested review from LaVLaS and pablofelix August 11, 2022 11:25

atheo89 requested a review from anishasthana August 11, 2022 11:26

accorvin reviewed Aug 11, 2022

View reviewed changes

monitoring/prometheus/prometheus-configs.yaml Show resolved Hide resolved

monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved

anishasthana mentioned this pull request Aug 16, 2022

Migrate completely from JupyterHub to KFNBC #257

Merged

atheo89 force-pushed the rhods-4765 branch 2 times, most recently from 04e9927 to e32b3d1 Compare August 17, 2022 12:51

accorvin reviewed Aug 19, 2022

View reviewed changes

monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved

monitoring/prometheus/prometheus-configs.yaml Show resolved Hide resolved

atheo89 force-pushed the rhods-4765 branch from 2ae4c4b to 84e97a2 Compare August 22, 2022 12:22

atheo89 force-pushed the rhods-4765 branch from d7579cb to 0d9622a Compare August 23, 2022 14:36

atheo89 mentioned this pull request Aug 23, 2022

Corrected kfnbc alert rule expressions to firing when the pods are unreachable #255

Closed

8 tasks

jgarciao requested changes Aug 24, 2022

View reviewed changes

openshift-ci bot assigned jgarciao Aug 24, 2022

accorvin mentioned this pull request Aug 25, 2022

Configure the blackbox exporter to authenticate to probed services #258

Closed

8 tasks

accorvin reviewed Aug 25, 2022

View reviewed changes

deploy.sh Show resolved Hide resolved

jgarciao requested changes Aug 25, 2022

View reviewed changes

accorvin force-pushed the rhods-4765 branch from b9eb014 to 8c26ee3 Compare August 25, 2022 20:03

atheo89 force-pushed the rhods-4765 branch from d65ecae to 213057e Compare August 26, 2022 14:37

atheo89 force-pushed the rhods-4765 branch from 213057e to 6e6e42a Compare August 26, 2022 14:59

atheo89 closed this Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update rhods prometheus stack according the kfnbc migration #256

Update rhods prometheus stack according the kfnbc migration #256

atheo89 commented Aug 10, 2022 •

edited by openshift-ci bot

openshift-ci bot commented Aug 10, 2022

atheo89 commented Aug 11, 2022

atheo89 commented Aug 11, 2022 •

edited

accorvin commented Aug 22, 2022

atheo89 commented Aug 22, 2022 •

edited

anishasthana commented Aug 23, 2022

atheo89 commented Aug 23, 2022

anishasthana commented Aug 23, 2022

atheo89 commented Aug 24, 2022

accorvin commented Aug 25, 2022 •

edited by openshift-ci bot

accorvin commented Aug 25, 2022 •

edited

accorvin commented Aug 25, 2022

jgarciao Aug 25, 2022

accorvin Aug 25, 2022

cfchase Aug 25, 2022

accorvin Aug 26, 2022

cfchase Aug 26, 2022

accorvin commented Aug 25, 2022

accorvin commented Aug 25, 2022

accorvin commented Aug 25, 2022

LaVLaS commented Aug 26, 2022

anishasthana commented Aug 26, 2022

atheo89 commented Aug 29, 2022

Update rhods prometheus stack according the kfnbc migration #256

Update rhods prometheus stack according the kfnbc migration #256

Conversation

atheo89 commented Aug 10, 2022 • edited by openshift-ci bot

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Aug 10, 2022

atheo89 commented Aug 11, 2022

atheo89 commented Aug 11, 2022 • edited

accorvin commented Aug 22, 2022

atheo89 commented Aug 22, 2022 • edited

anishasthana commented Aug 23, 2022

atheo89 commented Aug 23, 2022

anishasthana commented Aug 23, 2022

atheo89 commented Aug 24, 2022

accorvin commented Aug 25, 2022 • edited by openshift-ci bot

accorvin commented Aug 25, 2022 • edited

accorvin commented Aug 25, 2022

jgarciao Aug 25, 2022

Choose a reason for hiding this comment

accorvin Aug 25, 2022

Choose a reason for hiding this comment

cfchase Aug 25, 2022

Choose a reason for hiding this comment

accorvin Aug 26, 2022

Choose a reason for hiding this comment

cfchase Aug 26, 2022

Choose a reason for hiding this comment

accorvin commented Aug 25, 2022

accorvin commented Aug 25, 2022

accorvin commented Aug 25, 2022

LaVLaS commented Aug 26, 2022

anishasthana commented Aug 26, 2022

atheo89 commented Aug 29, 2022

atheo89 commented Aug 10, 2022 •

edited by openshift-ci bot

atheo89 commented Aug 11, 2022 •

edited

atheo89 commented Aug 22, 2022 •

edited

accorvin commented Aug 25, 2022 •

edited by openshift-ci bot

accorvin commented Aug 25, 2022 •

edited