Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon_id used for RGW pods is not unique in case of multiple instances #13674

Open
rkachach opened this issue Feb 2, 2024 · 9 comments
Open
Assignees
Labels

Comments

@rkachach
Copy link
Contributor

rkachach commented Feb 2, 2024

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

Just to give some background related to this issue. With the introduction of ceph-exporter some of the performance metrics are generated by this new component (instead of the prometheus mgr module). The problem is that this component doesn't have some metadata related to the metrics such as: hostname, daemon, etc which is available only on the mgr prometheus module. To fix this problem, ceph-exporter uses a unique label instance_id as key to join the metrics generated by ceph-exporter with the metadata coming from the prometheus module rgw_metadata.

In practice ceph-exporter only has access to the rgw daemon socket filename which it is used in this case to extract the instance_id. In cephadm deployments for example this value is set to hrgsea from daemon_name=rgw.foo.ceph-node-00.hrgsea.2.94739968030880 which is a unique value across all the rgw daemons. However, in a Rook deployment this is not the case. Following are the socket names generated for an object store with the name my-store and two instances:

$ ls -l /var/lib/rook/exporter/
total 0
srwxr-xr-x 1 167 167 0 Feb  6 08:38 ceph-client.ceph-exporter.asok
srwxr-xr-x 1 167 167 0 Feb  6 08:39 ceph-client.rgw.my.store.a.1.94334785460384.asok
srwxr-xr-x 1 167 167 0 Feb  6 08:39 ceph-client.rgw.my.store.a.1.94820349827232.asok

In this case ceph-exporter will ends up setting instance_id to a which is not unique.

On the rook side, daemon_id label generated for each rgw instance is not unique either. For example, using the following spec gives place to two instances with daemon_id set to my-store

apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  name: my-store
  namespace: rook-ceph # namespace:cluster
spec:
  metadataPool:
    replicated:
      size: 1
  dataPool:
    replicated:
      size: 1
  preservePoolsOnDelete: false
  gateway:
    port: 80
    # securePort: 443
    instances: 2

This daemon id is used to generate later the daemon socket name. So we need it to be unique otherwise ceph-exporter cannot extract this information from the filename.

Expected behavior:

Rook should append the corresponding letter (a,b, c... ) to the daemon_id to make it unique

How to reproduce it (minimal and precise):

Use the provided spec to create RGW instances and observe the daemon_id for each one of them.

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator's logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name>
    When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
    Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of kubectl commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health
    To get the status of the cluster, use kubectl rook-ceph ceph status
    For more details, see the Rook kubectl Plugin

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@rkachach rkachach added the bug label Feb 2, 2024
@rkachach rkachach self-assigned this Feb 2, 2024
@BlaineEXE
Copy link
Member

We used to manage the RGWs like we do other daemons, where each has a different deployment. I think the code for handling this might still be in Rook, but I'm unsure.

However, this is not the most desired design. We would like to make the fullest use of k8s primitives as possible, and in this case, continuing to be able to set a number of replicas for an RGW deployment would be ideal. Following on this idea, it would be best if RGWs continued to not need individual IDs.

If it does become necessary, perhaps something we could do is see whether each RGW can take its ID using the Pod downward API if necessary. Or it could be better to use a StatefulSet for RGWs to do a similar thing.

@BlaineEXE
Copy link
Member

The reason why RGWs don't have unique IDs are for a few reasons.

In our experience, RGWs generally don't need to have unique IDs.

  • RGWs use a standard HTTP(S) web protocol and don't have the same hard requirements from Ceph for them to have individual IDs.
  • Usually, if a user wants to change an RGW setting, they want to change all RGWs and not just one. Thus, users also don't need them to have unique IDs.
  • As a side note, this also means that RGWs are the only component in Rook that we are able to use a single Service for, versus all other daemons needing individual Services.

Technically speaking, in Kubernetes, the reason comes down to using a Deployment with Replicas >1 to run RGWs. This is a much more simplified deployment model than tracking individual deployments for RGWs. If Ceph weren't so picky about the IDs of things like mons and OSDs, Rook would prefer to use the same model for deploying those as well.

I want to understand more about why you are flagging this as an issue? Is this blocking some feature? Or is it just that it's confusing that things are different for RGWs?

Another deployment model that offers more ID management is StatefulSets. These work like Deployments, but each replica has a unique ID added. This could be used for RGWs if we do find that we need unique IDs for each.

However, I would still suggest that we try to look into making changes to Ceph such that it doesn't care about RGW individual identity as much, if possible. It was a win for the Rook project when we could use a single Deployment for all RGWs to simplify the management.

@BlaineEXE
Copy link
Member

We discussed in huddle using the random pod ID as each RGWs individual ID. Redo says cephadm uses a random ID for each RGW (though I'm not clear exactly as to when the ID changes with cephadm). I think this suggests that Rook can safely use the randomized pod ID to achieve a similar thing without much risk.

I think the solution will be pretty easy and involve 2 small changes:

  • use the downward API to apply the pod ID into an env var: POD_ID
    • can use POD_NAME to see a similar example
  • use that env var in the flag that applies the RGW's daemon name
    • something like --daemon-id=rook-ceph-rgw-my-store.a.$(POD_ID) (I don't remember the flag name or full details, so don't take this as a spec, just a hint)

@rkachach
Copy link
Contributor Author

rkachach commented Apr 9, 2024

@BlaineEXE thanks for the detailed suggestion. I can give it a try and if it works or not 👍

@rkachach
Copy link
Contributor Author

rkachach commented Apr 9, 2024

@BlaineEXE

I gave it a try but unfortunately using the POD_ID doesn't seem to be easy. The problem is that the keyring name must be the same as the --id flag used to run the daemon. However, the keyring is generated before the pod creation on:

https://github.com/rook/rook/blob/master/pkg/operator/ceph/object/rgw.go#L137-L142

At this point we don't have the pod_id yet, so we can use it to generate the keyring. As an alternative, I tried to add some random suffix to the pod name (and consequently to the keyring). This seems to work but the operator starts creating more and more rgw instances (because of the random part I guess). You can find the changes are on the PR:

#14048

I think this happens because of the random suffix which changes the secret (used by the controller to watch for changes). But I'm not sure as I don't understand how the rgw reconcile should work in this case.

@travisn
Copy link
Member

travisn commented Apr 9, 2024

The only way we can continue to use a single RGW deployment, is if we can use a single keyring even with different pod names at runtime. If the operator creates separate keyrings for each RGW instance, this means we have to create separate deployments, which means we have to revert back to the old behavior and we miss out on benefits like the pod autoscaler. To avoid that regression, we would really need a way for the metrics to have some unique id without a separate keyring.

@BlaineEXE
Copy link
Member

We discussed in huddle, but I wanted to add a couple notes here. It may be possible to get RGWs to share a keyring while still issuing each one a unique ID using the Pod ID. Looking through the cephx doc (link below), I notice that the mons (which share a keyring) seem to use the principal mon., with a trailing dot. Conversely mgr, OSD, etc. use principals like osd.{$id}. I wonder if it's possible to add a trailing dot to the RGW principal we use for its auth to allow us to reuse the same key for all RGWs.

https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/

I've been looking around for ceph examples, and they are hard to find. This doc (below) uses ceph tell osd.*, so maybe we can try using the * in the principal as well.

https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/#override-values

This doc may also have helpful info, but I can't say for sure: https://docs.ceph.com/en/latest/rados/operations/user-management/

After looking deeply, I am becoming concerned that the rgw.<store-name>.a. strategy may not work, but it is worth trying at least.

Copy link

github-actions bot commented Jun 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@bootc
Copy link

bootc commented Jun 19, 2024

With newer versions of Prometheus and/or kube-prometheus-stack, this causes issues. When two RGW pods are scheduled on the same node we now see the following in our Prometheus logs:

ts=2024-06-18T21:34:01.914Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/rook-ceph/rook-ceph-exporter/0 target=http://10.62.1.5:9926/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=45

This also leads to PrometheusDuplicateTimestamps alerts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants