Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceMonitor contains a hard-coded serverName that assumes the operator namespace is cert-utils-operator #138

Open
cigna-asoria opened this issue Jun 6, 2022 · 22 comments

Comments

@cigna-asoria
Copy link

Hi -
We are on OpenShift 4.8.35 and updated our cert-utils to 1.3.10 in all our environments.
But we are getting an alert message that the cert-utils metrics is down.
cert-utils is installed in namespace openshift-operators and not cert-utils-operator.

The endpoint is the IP and I can get those metrics per the commands you specify in the wiki, even using the service name.
But I'm getting this error:
Get "https://x.x.x.x:8443/metrics": x509: certificate is valid for cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local, not cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc

so, i'm wondering if the problem is in the prometheus config for server_name.

tls_config: ca_file: /etc/prometheus/certs/secret_openshift-operators_cert-utils-operator-certs_tls.crt server_name: cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc insecure_skip_verify: false

the server_name in the Prometheus config is not valid per the error message.
Can this be the problem when trying to pull metrics?

@cigna-asoria
Copy link
Author

Only 12 issues listed and yet no updates?
Can someone please assist

@davgordo
Copy link
Contributor

It seems like the certificate being issued looks properly configured if the operator was installed to the openshift-operators namespace. But given that the service monitor seems to target a service in a namespace called cert-utils-operator, the DNS is not matching.

This shouldn't happen, because the template for the certificate resources takes into account the target namespace:
https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml

In this case, it seems that the {{ Release.Namespace }} did indeed get populated, but with the wrong namespace, which makes me think somehow Helm determined the wrong value, and I'm not exactly sure how that happened.

A few assumptions to validate:

  1. I assume Helm is being used to provision the operator
  2. I assume enableCertManager=true and as a result cert-manager is providing the certificate
  3. I assume the Certificate custom resource contains dnsNames that include .openshift-operators.svc
    If any of those assumptions are incorrect, please let me know. That will change my point of view.

And one (speculative) thing to try:
Assuming that Helm is confused about the target namespace, I'm curious what would happen if we were more explicit and used the --namespace flag when deploying. Perhaps that will result in the correct value substitution for {{ Release.Namespace }}.

Thanks for your patience.

@cigna-asoria
Copy link
Author

Hi @davgordo -
I did not install the cert-utils operator through Helm. I actually installed it through OperatorHub UI via the OpenShift Console.
Can I provide you with any additional information?

@davgordo
Copy link
Contributor

Ah okay thanks for the clarification then @cigna-asoria I'm going to see if I can recreate the issue, sounds like it should be pretty easy to recreate.

The only things that might be helpful for me to reference are:

  1. The yaml for cert-utils-operator-controller-manager-metrics-service
  2. The data from certificate secret, or if not, just a list of the DNS from the issued certificate

I might discover that the problem is not challenging to recreate in which case I'll be able to reference these things in my own environment. But if you have time, it couldn't hurt to have more info.

@cigna-asoria
Copy link
Author

@davgordo -
We do have cert-manager installed and I just checked, there is no certificate for cert-utils like the one provided in https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml

Let me get the data your requested

@davgordo
Copy link
Contributor

davgordo commented Jun 14, 2022

Yes, so for context. When installing via Helm, we provide cert-manager support because we're making an assumption (sometimes it's a bad assumption) that users using Helm are probably targeting plain k8s.

When the target platform is OpenShift, on the other hand, there are some built-in certificate capabilities that we can leverage instead. Specifically you'll see this config in the annotations of the cert-utils-operator-controller-manager-metrics-service. Those annotations will essentially ask the platform to provide a certificate secret that matches up with the Service definition.

So with that background, I just used OLM to deploy this operator, and the result looked okay to me so far. If I decode the certificate, I see the following SANS:

  • cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc
  • cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc.cluster.local

Those look good because they reflect the cert-utils-operator namespace. So now I'm more curious about the certificate data and the service annotations that you are seeing in your environment.

@cigna-asoria
Copy link
Author

Here is the service yaml, for DNS, how do I pull that information? I can't provide the secret since it contains certificates.
I did remove the UID and IP's below.

kind: Service
apiVersion: v1
metadata:
annotations:
service.alpha.openshift.io/serving-cert-secret-name: cert-utils-operator-certs
resourceVersion: '974328279'
name: cert-utils-operator-controller-manager-metrics-service
managedFields:
- manager: catalog
operation: Update
apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
.: {}
'f:service.alpha.openshift.io/serving-cert-secret-name': {}
'f:labels':
.: {}
'f:control-plane': {}
'f:ownerReferences':
.: {}
.: {}
'f:apiVersion': {}
'f:blockOwnerDeletion': {}
'f:controller': {}
'f:kind': {}
'f:name': {}
'f:uid': {}
'f:spec':
'f:ports':
.: {}
'k:{"port":8443,"protocol":"TCP"}':
.: {}
'f:name': {}
'f:port': {}
'f:protocol': {}
'f:targetPort': {}
'f:selector':
.: {}
'f:control-plane': {}
'f:sessionAffinity': {}
'f:type': {}
- manager: olm
operation: Update
apiVersion: v1
time: '2022-05-11T16:59:42Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:labels':
'f:operators.coreos.com/cert-utils-operator.openshift-operators': {}
namespace: openshift-operators
ownerReferences:
- apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
name: cert-utils-operator.v1.3.10
controller: false
blockOwnerDeletion: false
labels:
control-plane: cert-utils-operator
operators.coreos.com/cert-utils-operator.openshift-operators: ''
spec:
ports:
- name: https
protocol: TCP
port: 8443
targetPort: https
selector:
control-plane: cert-utils-operator
clusterIP: x.x.x.x
clusterIPs:
- x.x.x.x
type: ClusterIP
sessionAffinity: None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
status:
loadBalancer: {}

@cigna-asoria
Copy link
Author

Here is the DNS output.

Downloads % openssl x509 -in cert.crt -text -noout |grep DNS
DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local
Downloads %

@cigna-asoria
Copy link
Author

@davgordo - I provided the information above. All seems right so why did Prometheus use the wrong server_name?

@davgordo
Copy link
Contributor

So I think it doesn't look right to me, because I thought this operator is installed in the cert-utils-operator namespace, and the DNS on the cert would lead me to believe that it is installed in the openshift-operators namespace.

The operator is deployed to the cert-utils-operator namespace, right? Or did I misunderstand?

@cigna-asoria
Copy link
Author

@davgordo - cert-utils is installed under openshift-operators not cert-utils-operator - that is why i think we are running into this issue.

@davgordo
Copy link
Contributor

@davgordo - cert-utils is installed under openshift-operators not cert-utils-operator - that is why i think we are running into this issue.

Ah hah! My apologies for misunderstanding. So Prometheus is going to search for services usually by label. We can tell it what labels to search for with ServiceMonitor configuration. I would like to see that ServiceMonitor yaml if you can provide it.

My cluster spun down, but as soon as I spin back up, I will try to specify the openshift-operators namespace when I install with OLM and try again to recreate.

Wild guess but, you don't happen to have a namespace called cert-utils-operator on the same cluster, do you? Just eliminating some variables. I'm thinking a left-over Service that wasn't cleaned up from a previous installation could cause problems.

@cigna-asoria
Copy link
Author

@davgordo No, we don't have a namespace called cert-utils-operator -- Let me check where I can pull the ServiceMonitor

@cigna-asoria
Copy link
Author

@davgordo Found it and I think this might be the problem? I bolded it below.

Downloads># oc get ServiceMonitor cert-utils-operator-controller-manager-metrics-monitor -n openshift-operators -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
creationTimestamp: "2022-05-06T20:35:03Z"
generation: 1
labels:
control-plane: cert-utils-operator
managedFields:

  • apiVersion: monitoring.coreos.com/v1
    fieldsType: FieldsV1
    fieldsV1:
    f:metadata:
    f:labels:
    .: {}
    f:control-plane: {}
    f:ownerReferences:
    .: {}
    .: {}
    f:apiVersion: {}
    f:blockOwnerDeletion: {}
    f:controller: {}
    f:kind: {}
    f:name: {}
    f:uid: {}
    f:spec:
    .: {}
    f:endpoints: {}
    f:selector:
    .: {}
    f:matchLabels:
    .: {}
    f:control-plane: {}
    manager: catalog
    operation: Update
    time: "2022-05-11T16:59:28Z"
    name: cert-utils-operator-controller-manager-metrics-monitor
    namespace: openshift-operators
    ownerReferences:
  • apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: cert-utils-operator.v1.3.10
    resourceVersion: "974327291"
    spec:
    endpoints:
  • bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https
    scheme: https
    tlsConfig:
    ca:
    secret:
    key: tls.crt
    name: cert-utils-operator-certs
    optional: false
    serverName: cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc
    selector:
    matchLabels:
    control-plane: cert-utils-operator
    Downloads>#

@davgordo
Copy link
Contributor

Now we're cookin'. Server name is wrong there. Thanks for all your help with the extra info. The problem is clear now. We'll have to do some brainstorming for a fix.

@cigna-asoria
Copy link
Author

@davgordo - Yeah! Please do keep me informed. I have many clusters with this issue that i definitely want to fix.

@davgordo
Copy link
Contributor

@cigna-asoria actually, I don't know for sure whether OLM creates that service monitor automatically... Did you all configure that, or was that provided by the operator provisioning?

@cigna-asoria
Copy link
Author

@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.

@davgordo
Copy link
Contributor

@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.

Ah I see it in my environment too. Thanks again.

@davgordo
Copy link
Contributor

@cigna-asoria FYI, I know it's not an ideal fix, but I am able to modify the serverName manually and this change does not get overwritten by the operator. This might help you temporarily until we make the next release.

@davgordo davgordo changed the title OpenShift 4.8.x - Prometheus cert-utils metrics down ServiceMonitor contains a hard-coded serverName that assumes the operator namespace is cert-utils-operator Jun 14, 2022
@cigna-asoria
Copy link
Author

@davgordo - Thanks, I will go that route until a fix is in place. Thanks again!

@felixkrohn
Copy link

This issue seems to persist as the fix linked above apparently hasn't been merged, could it be re-opened?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants