Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All services with a selector for the mgr daemon should be updated if there are multiple mgr daemons and the active mgr changes #7988

Closed
psavva opened this issue May 25, 2021 · 16 comments · Fixed by #9467
Assignees
Labels
Projects

Comments

@psavva
Copy link

psavva commented May 25, 2021

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
I have setup my rook-ceph cluster, and have enabled the dashboard in the CephCluster CRD.

[root@DGCVM01 ceph]# kubectl describe cephcluster -n rook-ceph
Name:         rook-ceph
Namespace:    rook-ceph
Labels:       <none>
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephCluster
Metadata:
  Creation Timestamp:  2021-05-24T08:59:07Z
  Finalizers:
    cephcluster.ceph.rook.io
  Generation:        3
  Resource Version:  628081
  UID:               e0d6dbe4-9e30-4d4f-8083-030a6b69cc5c
Spec:
  Ceph Version:
    Allow Unsupported:  false
    Image:              ceph/ceph:v15.2.11
  Cleanup Policy:
    Allow Uninstall With Volumes:  false
    Confirmation:
    Sanitize Disks:
      Data Source:                                    zero
      Iteration:                                      1
      Method:                                         quick
  Continue Upgrade After Checks Even If Not Healthy:  false
  Crash Collector:
    Disable:  false
  Dashboard:
    Enabled:           true
    Ssl:               true
  Data Dir Host Path:  /var/lib/rook
  Disruption Management:
    Machine Disruption Budget Namespace:  openshift-machine-api
    Manage Machine Disruption Budgets:    false
    Manage Pod Budgets:                   true
    Osd Maintenance Timeout:              30
    Pg Health Check Timeout:              0

I have also installed the External Dashboard nodeport service

[root@DGCVM01 ceph]# cat dashboard-external-https.yaml
apiVersion: v1
kind: Service
metadata:
  name: rook-ceph-mgr-dashboard-external-https
  namespace: rook-ceph # namespace:cluster
  labels:
    app: rook-ceph-mgr
    rook_cluster: rook-ceph # namespace:cluster
spec:
  ports:
    - name: dashboard
      port: 8443
      protocol: TCP
      targetPort: 8443
  selector:
    app: rook-ceph-mgr
    rook_cluster: rook-ceph
  sessionAffinity: None
  type: NodePort

When visiting the ceph dashboard, i'm redirected to a wrong URL.
You will notice that I'm accessing my internal IP and nodeport, and I'm redirected to the url: rook-ceph-mgr-a-84c875bd95-svhnd This is the bug....
See here:

image

Expected behavior:
The Ceph Dashboard should appear OK.

Environment:

  • OS (e.g. from /etc/os-release):
[root@DGCVM01 ceph]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
    Linux DGCVM01 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux * Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
[root@rook-ceph-operator-65965c66b5-mznfn /]# rook version
rook: v1.6.3
go: go1.16.3

  • Storage backend version (e.g. for ceph do ceph -v):
[root@rook-ceph-operator-65965c66b5-mznfn /]# ceph -v
ceph version 16.2.2 (e8f22dde28889481f4dda2beb8a07788204821d3) pacific (stable)
  • Kubernetes version (use kubectl version):
[root@DGCVM01 ceph]# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): BareMetal
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
[root@rook-ceph-tools-fc5f9586c-r68cl /]# ceph status
  cluster:
    id:     9ce72e85-8040-43a1-b19e-529ce34b32fb
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 27h)
    mgr: a(active, since 27h), standbys: b
    osd: 3 osds: 3 up (since 27h), 3 in (since 27h)

  data:
    pools:   2 pools, 33 pgs
    objects: 333 objects, 948 MiB
    usage:   5.8 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     33 active+clean

@psavva psavva added the bug label May 25, 2021
@parth-gr
Copy link
Member

@psavva which node IP address you are using for calling the dashboard service? You should use the IP address of the Node on which the MGR pod is created.
PS: You'll be able to contact the NodePort service, from outside the cluster, by requesting <NodeIP>:<NodePort>.

@psavva
Copy link
Author

psavva commented May 25, 2021

I'm certainly using the correct IP address, you can see I've hit the service and the response is a redirect.

I've made sure by Access via port forwarding and node IP, both methods fail.

I'm able to access my other dashboards fine. This is a new bug.

I have simulated it on 2 different clusters.

@parth-gr
Copy link
Member

Are the other clusters are using the same Rook and Ceph version? Or Is it like you are able to access your first dashboards for the Ceph object store and for the second Ceph object store you are not able to access it, If that the problem then IMPORTANT: Please note the dashboard will only be enabled for the first Ceph object store created by Rook.

And if their version mismatch you can try to upgrade the Ceph version to the v15.2.12 as there are recent fixes on it, or it will be by default added into Rook v1.6.4.

@travisn
Copy link
Member

travisn commented May 26, 2021

@psavva If you're getting a redirect, the response is coming from the standby mgr. The active mgr would respond properly, but the standby mgr only responds with redirects. When two mgrs are deployed, Rook periodically updates the dashboard service to direct traffic to the active mgr. If you're defining your own dashboard service based on a node port, you would also need to update it to only direct traffic to the active mgr.

@psavva
Copy link
Author

psavva commented May 26, 2021

@psavva If you're getting a redirect, the response is coming from the standby mgr. The active mgr would respond properly, but the standby mgr only responds with redirects. When two mgrs are deployed, Rook periodically updates the dashboard service to direct traffic to the active mgr. If you're defining your own dashboard service based on a node port, you would also need to update it to only direct traffic to the active mgr.

Thank you for this info, I'll update my configuration in the morning, and report back. It however seems that this should be automated somehow... Maybe the use of a new label to indicate the active manager would be a good solution, also would require an update on the current deployment manifest for Kubernetes

@travisn
Copy link
Member

travisn commented May 26, 2021

Rook does automatically update the rook-ceph-mgr-dashboard service when the active mgr changes. Can you pick up on those selectors? Rook doesn't update the labels on the mgrs for which one is active for several reasons including the node may be down where the previously active mgr was running, and that the implementation is from a sidecar on the mgr pods that doesn't have the ability to update its own labels.

@psavva
Copy link
Author

psavva commented Jun 10, 2021

@travisn I'm trying to figure out which is the active manager.
I cannot find any labels to highlight this?

[root@DGCVM01 ~]# kubectl -n rook-ceph describe pod rook-ceph-mgr-b-697f7f548b-bbqjx
Name:         rook-ceph-mgr-b-697f7f548b-bbqjx
Namespace:    rook-ceph
Priority:     0
Node:         dgcvm05/172.20.50.21
Start Time:   Fri, 28 May 2021 11:03:40 +0300
Labels:       app=rook-ceph-mgr
              ceph_daemon_id=b
              ceph_daemon_type=mgr
              instance=b
              mgr=b
              pod-template-hash=697f7f548b
              rook_cluster=rook-ceph
Annotations:  prometheus.io/port: 9283
              prometheus.io/scrape: true

and

[root@DGCVM01 ~]# kubectl -n rook-ceph describe pod rook-ceph-mgr-a-84c875bd95-t6tb6
Name:         rook-ceph-mgr-a-84c875bd95-t6tb6
Namespace:    rook-ceph
Priority:     0
Node:         dgcvm02/172.20.50.18
Start Time:   Fri, 28 May 2021 11:10:31 +0300
Labels:       app=rook-ceph-mgr
              ceph_daemon_id=a
              ceph_daemon_type=mgr
              instance=a
              mgr=a
              pod-template-hash=84c875bd95
              rook_cluster=rook-ceph
Annotations:  prometheus.io/port: 9283
              prometheus.io/scrape: true

@travisn
Copy link
Member

travisn commented Jun 10, 2021

@psavva The labels on the mgr pods are not updated when the active mgr changes, but the labels on the rook-ceph-mgr and rook-ceph-mgr-dashboard services are updated when the active mgr changes. What about this: kubectl -n rook-ceph describe svc rook-ceph-mgr

@github-actions
Copy link

github-actions bot commented Sep 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@dredwilliams
Copy link

FYI -- I was having issues with the dashboard:

  • lots of http failures in pieces of the dashboard
  • tmeouts -- redirecting to an internal cluster IP
  • generally slow responsiveness when it did show something

I am using a loadbalancer service to access the dashboard from a host separage from the k8s cluster, and the comments above about MGRs switching triggered an 'aha!' moment: I had just increased the MGR count from 1 to 2 when it started breaking. Not sure why, but it seems that I was actually talking to BOTH MGRs -- but only one of them was acutally using the service?

I reduced back to 1 MGR and the dashboard started working again.

Rook v1.7.4
Ceph v16.2.6

@travisn
Copy link
Member

travisn commented Sep 29, 2021

@dredwilliams Were you referencing the rook-ceph-mgr-dashbard service? Or did you have another service defined? When there are two mgrs, only one of them is active and Rook will update the service to automatically point to the active mgr. But if you have defined another service, Rook wouldn't know to update it.

@dredwilliams
Copy link

I had created the loadbalancer service using "dashboard-loadbalancer.yaml" ... which (looking now) created a new service 'rook-ceph-mgr-dashboard-loadbalancer' ... so that was probably my problem. I guess I expected that if I used a provided capability, it would respond appropriately ...

Thanks!

@travisn
Copy link
Member

travisn commented Sep 30, 2021

Agreed, Rook should be able to update any service that has a selector for the app: rook-ceph-mgr daemon. We'll take a look at this.

edit: Issue title is updated to reflect the proposal

@travisn travisn changed the title Ceph-Dashboard does not work All services with a selector for the mgr daemon should be updated if there are multiple mgr daemons and the active mgr changes Sep 30, 2021
@travisn travisn removed the wontfix label Sep 30, 2021
@travisn travisn added this to To do in v1.8 via automation Sep 30, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@prazumovsky
Copy link

@travisn any updates on this?

@travisn travisn self-assigned this Dec 9, 2021
@travisn
Copy link
Member

travisn commented Dec 9, 2021

@travisn any updates on this?

I'm hoping to look at it next week now.

@travisn travisn moved this from To do to In progress in v1.8 Jan 11, 2022
v1.8 automation moved this from In progress to Done Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v1.8
Done
Development

Successfully merging a pull request may close this issue.

5 participants