Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rhods prometheus stack according the kfnbc migration #256

Closed
wants to merge 1 commit into from

Conversation

atheo89
Copy link
Member

@atheo89 atheo89 commented Aug 10, 2022

This PR cleans up the obsolete monitoring targets, alerts, and recording rules from the Prometheus configuration

Description

Applied the following changes:

  • Removed Traefik
  • Removed Jupyterhub and JupyterhubDB
  • Added monitoring target for Jupyter notebook spawner page (Dependent by this PR Configure the blackbox exporter to authenticate to probed services #258)
  • Updated the rhods_aggregate_availability metric to check also the spawner
  • Corrected kfnbc alert rule expressions to firing when the pods are unreachable

How Has This Been Tested?

  • Ensure that Traefik proxy doesn't exist on recording rules and on RHODS Probe Success Burn Rate alert rules
  • Ensure that Jupyterhub doesn't exist on recording rules
  • To ensure that Usage Metrics work
    • for rhods_total_users try to create and delete multiple notebooks with different users
    • for rhods_active_users try to create multiple notebooks with different users
  • Ensure that alerts for Jupyter notebook spawner are working
    • to check if the spawner reacts as expected delete the ServiceAccounts notebook-controller-service-account, odh-notebook-controller-manager, and rhods-dashboard and monitor the downtime via rhods_aggregate_availability (That is going to disturb the communication)
  • Ensure that the rhods_aggregate_availability metric displays correctly the downtime of rhods-dashboard, notebook-spawner and combined
  • To test the KFNBC alerts:
    • Ensure that the notebook-controller-deployment-xxxxx-xxx is down for more than 5 mins
    • Ensure that the odh-notebook-controller-manager-xxxxx-xxx is down for more than 5 mins

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • For commits that came from upstream, [UPSTREAM] has been prepended to the commit message.
  • JIRA link(s): https://issues.redhat.com/browse/RHODS-4765
  • The Jira story is acked.
  • Live build image: quay.io/accorvin/rhods-operator-live-catalog:1.0.0-rhods-4765
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • QE contact acknowledges that this has been tested and is approved for merge.

@openshift-ci
Copy link

openshift-ci bot commented Aug 10, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval by writing /assign @anishasthana in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atheo89
Copy link
Member Author

atheo89 commented Aug 11, 2022

Adding in /cc @samu to be aware for the PRs that deal with the changes on the monitoring stack

@atheo89
Copy link
Member Author

atheo89 commented Aug 11, 2022

Some notes and thoughts regarding the third point of this PR Add monitoring target for Jupyter notebook spawner page

Case 1:
Adding the URLs
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController/spawner or
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController
as targets into the job: user_facing_endpoints_status and alongside deleting the ServiceAccounts notebook-controller-service-account, odh-notebook-controller-manager and rhods-dashboard (Having the operator scaled down) I don't see any downtime on the targets via rhods_aggregate_availability metric. (The status of the scraping targets is UP)

Case 2:
Adding the URLs
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController/spawner or
- rhods-dashboard-redhat-ods-applications.apps.atheodor.dev.datahub.redhat.com/notebookController
as targets into the job: user_facing_endpoints_status and stop notebook server (via rhods dashboard) I don't see any downtime again through rhods_aggregate_availability metric.

I believe the reason that we don't see any downtime while we scrape these targets is that the kfnbc spawner is embedded on rhods dashboard.

Any alternative thoughts regarding how to scrape or not the notebook spawner?

@atheo89 atheo89 marked this pull request as ready for review August 11, 2022 11:25
@accorvin
Copy link
Member

@atheo89 these changes look good to me (other than the small whitespace nit). Do you have a live build I can test with?

@atheo89
Copy link
Member Author

atheo89 commented Aug 22, 2022

@accorvin today, we had a discussion with @samuelvl about the live build. He is going to bunch all the monitoring changes together and he will provide a live build. Once it will be ready we will add the link to it.

@anishasthana
Copy link
Member

@atheo89 the prometheus stack init container will also need to be updated to account for traefik proxy
https://github.com/red-hat-data-services/odh-deployer/blob/main/monitoring/prometheus/prometheus.yaml#L22-L35

@atheo89
Copy link
Member Author

atheo89 commented Aug 23, 2022

@anishasthana, I should remove the command, right?

@anishasthana
Copy link
Member

It would be better to update it so that prometheus is waiting for one of the notebook controller pods to come up.

@atheo89
Copy link
Member Author

atheo89 commented Aug 24, 2022

Note: Added on this PR the changes for the fault firing of the kfnbc alerts, and I closed the old one. #255

monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved
monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved
monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved
monitoring/prometheus/prometheus-configs.yaml Outdated Show resolved Hide resolved
@accorvin
Copy link
Member

accorvin commented Aug 25, 2022

A new live build with these changes is available at quay.io/accorvin/rhods-operator-live-catalog:1.0.0-rhods-4765

@accorvin
Copy link
Member

accorvin commented Aug 25, 2022

@atheo89 I'm getting an error in the deployer init container:

sed: -e expression #1, char 101: unknown option to `s'

Edit: the pod is here if you want to look at it: https://console-openshift-console.apps.acorvin.mxg0.s1.devshift.org/k8s/ns/redhat-ods-operator/pods/rhods-operator-5745bdcc4f-zthw4/logs?container=rhods-deployer

deploy.sh Show resolved Hide resolved
@accorvin
Copy link
Member

I believe I've found the problem that was causing the deploy script to fail. I've added b9eb014 with a fix and am building a new live image now.

monitoring/prometheus/prometheus-configs.yaml Show resolved Hide resolved
monitoring/prometheus/prometheus-configs.yaml Show resolved Hide resolved
rhods_dashboard_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'" 2 30 | tr -d "'")

sed -i "s/<jupyterhub_host>/$jupyterhub_host/g" monitoring/prometheus/prometheus-configs.yaml
NOTEBOOK_SUFFIX="/notebookController/spawner"
notebook_spawner_host=$(oc::wait::object::availability "oc get route rhods-dashboard -n $ODH_PROJECT -o jsonpath='{.spec.host}'$NOTEBOOK_SUFFIX'" 2 30 | tr -d "'")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable contains the route to the new spawner page (e.g. https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner), that is used in the alert RHODS Jupyter Probe Success Burn Rate

What I have experimented in the livebuild is that, if the Kubeflow notebook controller pod and the ODH notebook controller pod are not running, the alert RHODS Jupyter Probe Success Burn Rate does not fire.

I believe the reason is that https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.XXXX.s1.devshift.org/notebookController/spawner renders the page correctly (returning HTTP 200) even if those pods are not running

What happens in that situation is that, when you try to spawn a notebook, you get this error:
error-spawning-notebook-when-pods-are-not-running

So, I don't know if the alert RHODS Jupyter Probe Success Burn Rate is useful with this implementation, having already RHODS Dashboard Probe Success Burn Rate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucferbux @andrewballantyne this to me seems like functionality that should perhaps change on the notebook spawner page. If the backend dependencies aren't met, does it make sense for the spawner page to generate a 200 status code when you load it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the dashboard in the business of checking for status of backend controllers. I think we'd do a poor job of it since we can't really anticipate the ways in which it fails for any given controller. Honestly, I think this UX is pretty good the way it is since it's properly conveying an error from the notebook spawn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to change something around how this works - either what we’re probing or how the page behaves. Note: I don’t think we should make this change now, but note it as a deficiency and fix it in 1.17.

Right now, kfnbc functionality other than the spawner UI can be broken and we wouldn’t know about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easy to add liveness probes to all the backend controllers so we aren't ever in this situatoin, shouldn't it?

@accorvin
Copy link
Member

The issue with the odh deployer init container failing has been resolved by my latest change (this change is available in the live build, I pushed the new version to the same tag).

I'm seeing an issue now where the probes are getting 403 errors. I'm investigating why that is happening.

@accorvin
Copy link
Member

Update: I found that the opendatahub-operator container in my live build did not properly include the changes from red-hat-data-services/odh-manifests#231

I'm generating a new live build that resolves that and will post with the status of my testing.

@accorvin
Copy link
Member

Update: this looks to be working correctly now with the latest version of my live build

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 26, 2022

Rhods livebuild for these changes with KFNBC and migration script: quay.io/llasmith/rhods-operator-live-catalog:1.16.0-dashboard-kfnbc

Deleted JupyterHub from Prometheus configuration

Changed alert name from JupyterHub to Jupyter on Builds rules

Added kube_persistentvolumeclaim_info on Federate Prometheus to retrive information of the PVCs

Added Usage Metrics rules: rhods_total_users and rhods_current_users

Added monitoring target for Jupyter notebook spawner page

Added alert block that fires the notebook spawner

Changed the expresion of RHODS Probe Success Burn Rate to look on name label instead of instance to keep an uniformity with the spawner's alerts

Updated rhods_aggregate_availability metric to display the notebook spawner

Changed alerting rule name to RHODS Dashboard Probe Success Burn Rate

Updated command on InitContainer to watch odh-notebook-controller-service instead Traefik-proxy

Corrected kfnbc alert rule expressions to firing when the pods are unrcachable

Updated the triage urls for SRE

Use auth when probing dashboard with the blabkbox exporter

Escape slashes in the NOTEBOOK_SUFFIX variable

These slashes broke the sed command on line 226. Escaping them resutls
in the variable being properly set.

Corrected Usage Metrics recording rules
@anishasthana
Copy link
Member

We've pulled Adriana's changes into #257

@atheo89 atheo89 closed this Aug 29, 2022
@atheo89
Copy link
Member Author

atheo89 commented Aug 29, 2022

These changes have been pulled into #257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants