Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update monitoring per component enabled/disabled #67

Merged

Conversation

zdtsw
Copy link

@zdtsw zdtsw commented Oct 2, 2023

ref: opendatahub-io#357

whats in the PR

  • add ods-configs manifests into operator image
  • add monitoring manifests into operator image
  • update dockerfile to get above manifests from operator repo directly
  • add watch on monitoring addon-management-odh-pamaters and configmap alertmanager
  • move witheventfilter predicates into each owns to prevent unnecessary never-stop reconcile
  • add new watch from DSCI to DSC on deletion
  • when no DSC instance in cluster, prometheus should not have component rules in config
  • add methods in component interface to write component rules into prometheus.yaml
  • revert encode part of the values in prometheus.yaml since it need plain text
  • add function MatchLineInFile to inject values in file than replace placeholder
  • cherry-pick form ODH: pr 597, 627, 583

Update from 19th Oct 2023

quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67-7
test case 4: no DSC, should no component rules too
create DSCI first, check prometheus config: 2 rules
create DSC with default managed components, check prometheus config: these component rules are added
delete DSC , check prometheus config: 2 rules

Update from 16th Oct 2023

quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67-6

test case 3: run on a self-managed cluster
after DSCI instacne is created, should not have prometheus created.
see controllers.DSCInitialization Monitoring enabled, won't apply changes {"cluster": "Self-Managed RHODS Mode"} in the log
odh-segment-key-config CM should be created in applicationamespace.

Update from 13th Oct 2023

quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67-5

test case1: email update

  • install from clean cluster (should have it either as managed cluster or mock it to be a managaed cluster with csv)
  • check DSCI is created, no DSC should be created.
  • wait till operator pod stopped reconcile, should see log "Success: finish config managed monitoring stack!"
  • update user email notification field/update secret directly
  • should see operator pod start to reconcile.
  • after a while go to check alertmanage CM is updated with new email
  • go check Prometheus pod should be restarted

test case2: component plugin-plugout

  • install DSC
  • go check Prometheus config, and see only "operator" and "deadmansitch" rules are there
  • enable a bunch of components (kserve wont be useful in this testcase)
  • go check Prometheus config again and see new rules for the enabled component have been added
  • same for the rules, only the rules for enabled components should be there: rhods metrics can be different, if you are on a mocked cluster it might say "UNKNOWN" in status
  • disable some components
  • go check Prometheus config again, it should reflect to the rule list
  • same for the rules

Update:

newest image
quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67-1 => with some fix and also enabled monitoring for self and managed both

new catsrc image is:
quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67 => monitoring is enabled on the Managed cluster

for testing purpose:
quay.io/wenzhou/opendatahub-operator-catalog:v2.10.67-0 => this enabled monitoring on both Self-Managed and Managed cluster.
should see in log "Monitoring enabled, should't apply changes but for test purpose we enable it here"

Old:

old test image for myself, when mocking to use for SelfManaged cluster
local build: quay.io/wenzhou/opendatahub-operator-catalog:v2.10.2-17
debug info in the screenshots wont be available to check in the new catsrc build.

when enable workbenches, from operator pod log:

Screenshot from 2023-10-04 10-58-25

check configmap:
Screenshot from 2023-10-04 11-00-10

then disable workbenches:
Screenshot from 2023-10-04 11-47-50

check configmap (this might take a minute to get updated)
Screenshot from 2023-10-04 11-47-50

- Move manifests into operator
- Add: DW promethus CM and merge alerting into one per component
- Change for replacing rules per component in yaml file
Add: rolebindings for monitoring and missing service-ca cm
Update: do not encode data before write into CM
update: job user_facing_endpoints_status_* to component based config
Update: change to add DSCI and reuse monitoring, remove SREMonitoring
Add ods-configs manifests
Fix missing rules after inject/remove rules
Rename manifests files + skip part of DSCI reconcile if not inital
chore: cherry-pick updates from upstream and correct DW SOP name
test(dsci): sync unit test from upstream
update(mcad): cleanup mcad targets
see: red-hat-data-services/odh-deployer@9394bf9
fix: replace namespace with placeholder
fix(monitoring): missing replacement
fix(monitoirng job): wrong controller name
add(dsci): watch on DSC when no instance left revert back to original
promethes config
update(monitoring): add support when no DSC instance
update(monitoring): manifests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Copy link

@VaishnaviHire VaishnaviHire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested successfully with ocp 4.13

/lgtm

@VaishnaviHire VaishnaviHire merged commit a195086 into red-hat-data-services:main Oct 20, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants