Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

Open
dmpe opened this issue May 24, 2024 · 0 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 regression status/to-reproduce team/observability&backup the team that is responsible for monitoring/logging and BRO

Comments

@dmpe
Copy link

dmpe commented May 24, 2024

Rancher Server Setup

  • Rancher version: 2.8.4
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
rke2 version v1.28.10+rke2r1 (b0d0d687d98f4fa015e7b30aaf2807b50edcc5d7)
go version go1.21.9 X:boringcrypto

Information about the Cluster

  • Kubernetes version: v1.28.10 +rke2r1
  • Cluster Type (Local/Downstream): LOCAL

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)

Admin


Reopening the issue from rancher/rke2#6000 per suggestion from @alexandreLamarre


Node(s) CPU architecture, OS, and Version: Linux my_secret_hostname 5.14.0-427.16.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 26 18:16:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

RHEL 9:

NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"

Cluster Configuration:

  • RKE2 Cluster which is used for installation of management GUI
  • 3 Servers/3 control nodes which also have "worker" role so that some additional apps could run on them, e.g. rancher-monitoring, vault integration etc.

Describe the bug:

Nginx Ingress metrics cannot be collected anymore. Per discussion in rancher/rke2#6000 I have assumed that it is rke2 related, now seems to be rancher-monitoring helm chart related. The chart needs either some adjustment or additional network policy which would allow prometheus to collect nginx-pod metrics.

Steps To Reproduce:

Following RKE2 config is being used to bootstap first control node:

$ sudo cat /etc/rancher/rke2/config.yaml
tls-san:
  - cluster.local
  - my_secret_ip
  - rancher-gui.mydomain.com
  - control-nodes-load-balancing.mydomain.com
cni: calico
node-name: my_secret_hostname 
profile: cis
selinux: True
write-kubeconfig-mode: "0644"
pod-security-admission-config-file: /etc/rancher/rke2/rke2-custom-pss.yaml
system-default-registry: "private docker registry"
audit-policy-file: /etc/rancher/rke2/audit-policy.yaml
  • install rancher GUI 2.8.4 release (via helm etc.)
  • install rancher-monitoring helm chart (incl. CRDs) - we use currently version: 103.1.0+up45.31.1 but it should not matter that much
  • among many others, following ServiceMonitor will be installed:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
  labels:
    app: rancher-monitoring-ingress-nginx
    app.kubernetes.io/managed-by: Helm
    component: ingress-nginx
    provider: kubernetes
    release: rancher-monitoring
  name: rancher-monitoring-ingress-nginx
  namespace: kube-system
spec:
  endpoints:
    - metricRelabelings:
        - action: replace
          replacement: my_secret_cluster_name
          sourceLabels:
            - __address__
          targetLabel: my_secret_cluster_name
        - action: replace
          replacement: local
          sourceLabels:
            - __address__
          targetLabel: cluster_id
      port: metrics
      tlsConfig:
        insecureSkipVerify: false
  jobLabel: component
  namespaceSelector:
    matchNames:
      - kube-system
  podTargetLabels:
    - component
    - pushprox-exporter
  selector:
    matchLabels:
      component: ingress-nginx
      k8s-app: pushprox-ingress-nginx-client
      provider: kubernetes
      release: rancher-monitoring
  • Check that in Prometheus GUI, target serviceMonitor/kube-system/rancher-monitoring-ingress-nginx is DOWN, with Get "http://ip:10254/metrics": context deadline exceeded

Once RKE2 default networking policy is adjusted, default-network-ingress-policy, it works again:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  annotations:
    np.rke2.io/ingress: resolved
  name: default-network-ingress-policy
  namespace: kube-system
spec:
  ingress:
    - ports:
        - port: http
          protocol: TCP
        - port: https
          protocol: TCP
        - port: 10254
          protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/name: rke2-ingress-nginx
  policyTypes:
    - Ingress

We are not using anything related to calico's global deny or similar. All network policies are default, not customized by us in any way.

Expected behavior:

Metrics about rke2-ingress-nginx-controllers can be collected/shown in grafana.

Actual behavior:

Metrics cannot be collected due to some Network Policy missing metrics port for nginx ingress pods . (This comes from a Service object called pushprox-ingress-nginx-client in kube-system NS.

@dmpe dmpe added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 24, 2024
@dmpe dmpe changed the title [BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI ingress-nginx pods [BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods May 24, 2024
@alexandreLamarre alexandreLamarre added the team/observability&backup the team that is responsible for monitoring/logging and BRO label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 regression status/to-reproduce team/observability&backup the team that is responsible for monitoring/logging and BRO
Projects
None yet
Development

No branches or pull requests

4 participants