daemon_id used for RGW pods is not unique in case of multiple instances #13674

rkachach · 2024-02-02T08:48:42Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Just to give some background related to this issue. With the introduction of ceph-exporter some of the performance metrics are generated by this new component (instead of the prometheus mgr module). The problem is that this component doesn't have some metadata related to the metrics such as: hostname, daemon, etc which is available only on the mgr prometheus module. To fix this problem, ceph-exporter uses a unique label instance_id as key to join the metrics generated by ceph-exporter with the metadata coming from the prometheus module rgw_metadata.

In practice ceph-exporter only has access to the rgw daemon socket filename which it is used in this case to extract the instance_id. In cephadm deployments for example this value is set to hrgsea from daemon_name=rgw.foo.ceph-node-00.hrgsea.2.94739968030880 which is a unique value across all the rgw daemons. However, in a Rook deployment this is not the case. Following are the socket names generated for an object store with the name my-store and two instances:

$ ls -l /var/lib/rook/exporter/
total 0
srwxr-xr-x 1 167 167 0 Feb  6 08:38 ceph-client.ceph-exporter.asok
srwxr-xr-x 1 167 167 0 Feb  6 08:39 ceph-client.rgw.my.store.a.1.94334785460384.asok
srwxr-xr-x 1 167 167 0 Feb  6 08:39 ceph-client.rgw.my.store.a.1.94820349827232.asok

In this case ceph-exporter will ends up setting instance_id to a which is not unique.

On the rook side, daemon_id label generated for each rgw instance is not unique either. For example, using the following spec gives place to two instances with daemon_id set to my-store

apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  name: my-store
  namespace: rook-ceph # namespace:cluster
spec:
  metadataPool:
    replicated:
      size: 1
  dataPool:
    replicated:
      size: 1
  preservePoolsOnDelete: false
  gateway:
    port: 80
    # securePort: 443
    instances: 2

This daemon id is used to generate later the daemon socket name. So we need it to be unique otherwise ceph-exporter cannot extract this information from the filename.

Expected behavior:

Rook should append the corresponding letter (a,b, c... ) to the daemon_id to make it unique

How to reproduce it (minimal and precise):

Use the provided spec to create RGW instances and observe the daemon_id for each one of them.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

The text was updated successfully, but these errors were encountered:

BlaineEXE · 2024-02-05T16:58:29Z

We used to manage the RGWs like we do other daemons, where each has a different deployment. I think the code for handling this might still be in Rook, but I'm unsure.

However, this is not the most desired design. We would like to make the fullest use of k8s primitives as possible, and in this case, continuing to be able to set a number of replicas for an RGW deployment would be ideal. Following on this idea, it would be best if RGWs continued to not need individual IDs.

If it does become necessary, perhaps something we could do is see whether each RGW can take its ID using the Pod downward API if necessary. Or it could be better to use a StatefulSet for RGWs to do a similar thing.

BlaineEXE · 2024-02-06T15:58:10Z

The reason why RGWs don't have unique IDs are for a few reasons.

In our experience, RGWs generally don't need to have unique IDs.

RGWs use a standard HTTP(S) web protocol and don't have the same hard requirements from Ceph for them to have individual IDs.
Usually, if a user wants to change an RGW setting, they want to change all RGWs and not just one. Thus, users also don't need them to have unique IDs.
As a side note, this also means that RGWs are the only component in Rook that we are able to use a single Service for, versus all other daemons needing individual Services.

Technically speaking, in Kubernetes, the reason comes down to using a Deployment with Replicas >1 to run RGWs. This is a much more simplified deployment model than tracking individual deployments for RGWs. If Ceph weren't so picky about the IDs of things like mons and OSDs, Rook would prefer to use the same model for deploying those as well.

I want to understand more about why you are flagging this as an issue? Is this blocking some feature? Or is it just that it's confusing that things are different for RGWs?

Another deployment model that offers more ID management is StatefulSets. These work like Deployments, but each replica has a unique ID added. This could be used for RGWs if we do find that we need unique IDs for each.

However, I would still suggest that we try to look into making changes to Ceph such that it doesn't care about RGW individual identity as much, if possible. It was a win for the Rook project when we could use a single Deployment for all RGWs to simplify the management.

BlaineEXE · 2024-04-08T15:55:48Z

We discussed in huddle using the random pod ID as each RGWs individual ID. Redo says cephadm uses a random ID for each RGW (though I'm not clear exactly as to when the ID changes with cephadm). I think this suggests that Rook can safely use the randomized pod ID to achieve a similar thing without much risk.

I think the solution will be pretty easy and involve 2 small changes:

use the downward API to apply the pod ID into an env var: POD_ID
- can use POD_NAME to see a similar example
use that env var in the flag that applies the RGW's daemon name
- something like --daemon-id=rook-ceph-rgw-my-store.a.$(POD_ID) (I don't remember the flag name or full details, so don't take this as a spec, just a hint)

rkachach · 2024-04-09T08:53:00Z

@BlaineEXE thanks for the detailed suggestion. I can give it a try and if it works or not 👍

rkachach · 2024-04-09T12:55:56Z

@BlaineEXE

I gave it a try but unfortunately using the POD_ID doesn't seem to be easy. The problem is that the keyring name must be the same as the --id flag used to run the daemon. However, the keyring is generated before the pod creation on:

https://github.com/rook/rook/blob/master/pkg/operator/ceph/object/rgw.go#L137-L142

At this point we don't have the pod_id yet, so we can use it to generate the keyring. As an alternative, I tried to add some random suffix to the pod name (and consequently to the keyring). This seems to work but the operator starts creating more and more rgw instances (because of the random part I guess). You can find the changes are on the PR:

#14048

I think this happens because of the random suffix which changes the secret (used by the controller to watch for changes). But I'm not sure as I don't understand how the rgw reconcile should work in this case.

travisn · 2024-04-09T15:27:39Z

The only way we can continue to use a single RGW deployment, is if we can use a single keyring even with different pod names at runtime. If the operator creates separate keyrings for each RGW instance, this means we have to create separate deployments, which means we have to revert back to the old behavior and we miss out on benefits like the pod autoscaler. To avoid that regression, we would really need a way for the metrics to have some unique id without a separate keyring.

BlaineEXE · 2024-04-09T16:28:22Z

We discussed in huddle, but I wanted to add a couple notes here. It may be possible to get RGWs to share a keyring while still issuing each one a unique ID using the Pod ID. Looking through the cephx doc (link below), I notice that the mons (which share a keyring) seem to use the principal mon., with a trailing dot. Conversely mgr, OSD, etc. use principals like osd.{$id}. I wonder if it's possible to add a trailing dot to the RGW principal we use for its auth to allow us to reuse the same key for all RGWs.

https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/

I've been looking around for ceph examples, and they are hard to find. This doc (below) uses ceph tell osd.*, so maybe we can try using the * in the principal as well.

https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/#override-values

This doc may also have helpful info, but I can't say for sure: https://docs.ceph.com/en/latest/rados/operations/user-management/

After looking deeply, I am becoming concerned that the rgw.<store-name>.a. strategy may not work, but it is worth trying at least.

github-actions · 2024-06-08T20:01:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

bootc · 2024-06-19T14:07:12Z

With newer versions of Prometheus and/or kube-prometheus-stack, this causes issues. When two RGW pods are scheduled on the same node we now see the following in our Prometheus logs:

ts=2024-06-18T21:34:01.914Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/rook-ceph/rook-ceph-exporter/0 target=http://10.62.1.5:9926/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=45

This also leads to PrometheusDuplicateTimestamps alerts.

rkachach added the bug label Feb 2, 2024

rkachach self-assigned this Feb 2, 2024

github-actions bot added the wontfix label Jun 8, 2024

BlaineEXE removed the wontfix label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon_id used for RGW pods is not unique in case of multiple instances #13674

daemon_id used for RGW pods is not unique in case of multiple instances #13674

rkachach commented Feb 2, 2024 •

edited

BlaineEXE commented Feb 5, 2024

BlaineEXE commented Feb 6, 2024

BlaineEXE commented Apr 8, 2024

rkachach commented Apr 9, 2024

rkachach commented Apr 9, 2024

travisn commented Apr 9, 2024

BlaineEXE commented Apr 9, 2024

github-actions bot commented Jun 8, 2024

bootc commented Jun 19, 2024

daemon_id used for RGW pods is not unique in case of multiple instances #13674

daemon_id used for RGW pods is not unique in case of multiple instances #13674

Comments

rkachach commented Feb 2, 2024 • edited

BlaineEXE commented Feb 5, 2024

BlaineEXE commented Feb 6, 2024

BlaineEXE commented Apr 8, 2024

rkachach commented Apr 9, 2024

rkachach commented Apr 9, 2024

travisn commented Apr 9, 2024

BlaineEXE commented Apr 9, 2024

github-actions bot commented Jun 8, 2024

bootc commented Jun 19, 2024

rkachach commented Feb 2, 2024 •

edited