[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

dmpe · 2024-05-24T21:28:37Z

Rancher Server Setup

Rancher version: 2.8.4
Installation option (Docker install/Helm Chart): Helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2

rke2 version v1.28.10+rke2r1 (b0d0d687d98f4fa015e7b30aaf2807b50edcc5d7)
go version go1.21.9 X:boringcrypto

Information about the Cluster

Kubernetes version: v1.28.10 +rke2r1
Cluster Type (Local/Downstream): LOCAL

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)

Admin

Reopening the issue from rancher/rke2#6000 per suggestion from @alexandreLamarre

Node(s) CPU architecture, OS, and Version: Linux my_secret_hostname 5.14.0-427.16.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 26 18:16:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

RHEL 9:

NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"

Cluster Configuration:

RKE2 Cluster which is used for installation of management GUI
3 Servers/3 control nodes which also have "worker" role so that some additional apps could run on them, e.g. rancher-monitoring, vault integration etc.

Describe the bug:

Nginx Ingress metrics cannot be collected anymore. Per discussion in rancher/rke2#6000 I have assumed that it is rke2 related, now seems to be rancher-monitoring helm chart related. The chart needs either some adjustment or additional network policy which would allow prometheus to collect nginx-pod metrics.

Steps To Reproduce:

Installed RKE2: using my custom ansible playbook, which is based on 2 public github repos: https://github.com/rancherfederal/rke2-ansible and https://github.com/lablabs/ansible-role-rke2. Again, it was installed using custom ansible playbook, which is drastically simplified.

Following RKE2 config is being used to bootstap first control node:

$ sudo cat /etc/rancher/rke2/config.yaml
tls-san:
  - cluster.local
  - my_secret_ip
  - rancher-gui.mydomain.com
  - control-nodes-load-balancing.mydomain.com
cni: calico
node-name: my_secret_hostname 
profile: cis
selinux: True
write-kubeconfig-mode: "0644"
pod-security-admission-config-file: /etc/rancher/rke2/rke2-custom-pss.yaml
system-default-registry: "private docker registry"
audit-policy-file: /etc/rancher/rke2/audit-policy.yaml

install rancher GUI 2.8.4 release (via helm etc.)
install rancher-monitoring helm chart (incl. CRDs) - we use currently version: 103.1.0+up45.31.1 but it should not matter that much
among many others, following ServiceMonitor will be installed:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
  labels:
    app: rancher-monitoring-ingress-nginx
    app.kubernetes.io/managed-by: Helm
    component: ingress-nginx
    provider: kubernetes
    release: rancher-monitoring
  name: rancher-monitoring-ingress-nginx
  namespace: kube-system
spec:
  endpoints:
    - metricRelabelings:
        - action: replace
          replacement: my_secret_cluster_name
          sourceLabels:
            - __address__
          targetLabel: my_secret_cluster_name
        - action: replace
          replacement: local
          sourceLabels:
            - __address__
          targetLabel: cluster_id
      port: metrics
      tlsConfig:
        insecureSkipVerify: false
  jobLabel: component
  namespaceSelector:
    matchNames:
      - kube-system
  podTargetLabels:
    - component
    - pushprox-exporter
  selector:
    matchLabels:
      component: ingress-nginx
      k8s-app: pushprox-ingress-nginx-client
      provider: kubernetes
      release: rancher-monitoring

Check that in Prometheus GUI, target serviceMonitor/kube-system/rancher-monitoring-ingress-nginx is DOWN, with Get "http://ip:10254/metrics": context deadline exceeded

Once RKE2 default networking policy is adjusted, default-network-ingress-policy, it works again:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  annotations:
    np.rke2.io/ingress: resolved
  name: default-network-ingress-policy
  namespace: kube-system
spec:
  ingress:
    - ports:
        - port: http
          protocol: TCP
        - port: https
          protocol: TCP
        - port: 10254
          protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/name: rke2-ingress-nginx
  policyTypes:
    - Ingress

We are not using anything related to calico's global deny or similar. All network policies are default, not customized by us in any way.

Expected behavior:

Metrics about rke2-ingress-nginx-controllers can be collected/shown in grafana.

Actual behavior:

Metrics cannot be collected due to some Network Policy missing metrics port for nginx ingress pods . (This comes from a Service object called pushprox-ingress-nginx-client in kube-system NS.

The text was updated successfully, but these errors were encountered:

dmpe added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 24, 2024

dmpe changed the title ~~[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI ingress-nginx pods~~ [BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods May 24, 2024

alexandreLamarre added the team/observability&backup the team that is responsible for monitoring/logging and BRO label May 27, 2024

MKlimuszka added status/to-reproduce regression priority/2 labels Jun 6, 2024

MKlimuszka assigned dharmit Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

dmpe commented May 24, 2024

[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

[BUG] rancher monitoring chart lacks Network Policy permission to collect metrics from GUI's ingress-nginx pods #45603

Comments

dmpe commented May 24, 2024