Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create CephObjectStore with external ceph cluster #13827

Open
achernya opened this issue Feb 27, 2024 · 11 comments
Open

Cannot create CephObjectStore with external ceph cluster #13827

achernya opened this issue Feb 27, 2024 · 11 comments

Comments

@achernya
Copy link

achernya commented Feb 27, 2024

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

I have an external ceph cluster that I imported using the instructions at https://rook.io/docs/rook/v1.13/Getting-Started/intro/. The cluster has rbd and cephfs services installed and exposed, and those were imported successfully. However, this ceph cluster does not have an existing rgw running.

I then went and followed the instructions on https://rook.io/docs/rook/latest-release/CRDs/Object-Storage/ceph-object-store-crd/ to create a CephObjectStore, placing the resource in the rook-ceph-external namespace.

This resulted in the operator having the following logs:

2024-02-27 19:12:08.437400 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v18.2.1...
2024-02-27 19:12:12.865292 I | ceph-spec: detected ceph image version: "18.2.1-0 reef"
2024-02-27 19:12:15.686430 I | ceph-object-controller: reconciling object store deployments
2024-02-27 19:12:15.854261 I | ceph-object-controller: ceph object store gateway service running at 10.254.252.62
2024-02-27 19:12:15.854313 I | ceph-object-controller: reconciling object store pools
2024-02-27 19:12:16.762490 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph-external/my-store". failed to create object store deployments: failed to create object pools: failed to create metadata pools: failed to create pool "my-store.rgw.control": failed to create replicated crush rule "my-store.rgw.control": failed to create crush rule my-store.rgw.control: exit status 13

I wasn't sure what exit status 13 meant, so I enabled debug logs, which didn't help as CephToolCommand.Run doesn't seem to log its output anywhere I can tell inside createReplicationCrushRule

and I ended up strace'ing outside the operator container to figure out what ceph command the operator was running, which turned out to be

/usr/bin/ceph status --connect-timeout=15 --cluster=rook-ceph-external --conf=/var/lib/rook/rook-ceph-external/rook-ceph-external.config --name=client.admin --keyring=/var/lib/rook/rook-ceph-external/client.admin.keyring --format json

If I run that command myself, I get

2024-02-27T18:58:37.787+0000 7ff5feaf2700 -1 auth: unable to find a keyring on /var/lib/rook/rook-ceph-external/client.admin.keyring: (2) No such file or directory
2024-02-27T18:58:37.787+0000 7ff5feaf2700 -1 AuthRegistry(0x7ff5f8064978) no keyring found at /var/lib/rook/rook-ceph-external/client.admin.keyring, disabling cephx
2024-02-27T18:58:37.791+0000 7ff5feaf2700 -1 auth: unable to find a keyring on /var/lib/rook/rook-ceph-external/client.admin.keyring: (2) No such file or directory
2024-02-27T18:58:37.791+0000 7ff5feaf2700 -1 AuthRegistry(0x7ff5f80680f0) no keyring found at /var/lib/rook/rook-ceph-external/client.admin.keyring, disabling cephx
2024-02-27T18:58:37.795+0000 7ff5feaf2700 -1 auth: unable to find a keyring on /var/lib/rook/rook-ceph-external/client.admin.keyring: (2) No such file or directory
2024-02-27T18:58:37.795+0000 7ff5feaf2700 -1 AuthRegistry(0x7ff5feaf0ea0) no keyring found at /var/lib/rook/rook-ceph-external/client.admin.keyring, disabling cephx
2024-02-27T18:58:37.795+0000 7ff5fc88e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-02-27T18:58:37.795+0000 7ff5fd08f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-02-27T18:58:37.795+0000 7ff5effff700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-02-27T18:58:37.795+0000 7ff5feaf2700 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication
[errno 13] RADOS permission denied (error connecting to the cluster)

Which makes sense, the external cluster only created client.healthchecker, not client.admin.

None of the documentation makes it clear if this is a supported configuration, and the error reporting leaves a bit to be desired if it is not. It is not clear to me if I should just change my envvars to import-external-cluster.sh to set ROOK_EXTERNAL_ADMIN_SECRET, and what the downsides are of doing that could be.

Expected behavior:
CephObjectStore is created successfully.

How to reproduce it (minimal and precise):

  1. Create an external ceph cluster (in my case, it was created by Proxmox automatically, see https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster)
  2. Import the ceph cluster into rook-ceph
  3. Follow the instructions on https://rook.io/docs/rook/latest-release/CRDs/Object-Storage/ceph-object-store-crd/ to create a CephObjectStore

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator's logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name>
    When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
    Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of kubectl commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health
    To get the status of the cluster, use kubectl rook-ceph ceph status
    For more details, see the Rook kubectl Plugin

Environment:

  • OS (e.g. from /etc/os-release): Debian GNU/Linux 12 (bookworm)
  • Kernel (e.g. uname -a): 6.1.0-18-cloud-amd64
  • Cloud provider or hardware configuration: VM hosted on Proxmox
  • Rook version (use rook version inside of a Rook Pod): rook: v1.13.3
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.7 (2dd3854d5b35a35486e86e2616727168e244f470) quincy (stable)
  • Kubernetes version (use kubectl version): v1.29.2
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm init'd
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
@achernya achernya added the bug label Feb 27, 2024
@travisn
Copy link
Member

travisn commented Feb 27, 2024

@achernya Did you create the object store with the object-external.yaml example? I suspect you create the object store with object.yaml, which is not for external cluster configuration. The failure looks like from attempting to create a CRUSH rule, which sounds like it's trying to fully create pools in the external cluster, which it doesn't have access to.

See also the Connect to an external object store topic.

@achernya
Copy link
Author

I used object.yaml, yes. My read of object-external.yaml is it's trying to set up a CRD pointing to an external rgw -- which I don't have. I actually do want rook-ceph to provision the radosgw frontends.

It sounds like this is potentially an unsupported configuration, and I should instead provision rgw externally and then use object-external.yaml?

@BlaineEXE
Copy link
Member

I believe (but am not certain) that the configuration you describe is possible. It looks like the current issue may be that the Rook cluster might not have an admin key, which is necessary to set up things for running against an external cluster.

It's also possible that there are some internal issues with Rook regarding radosgw-admin and admin API usage that makes Rook unable to fully realize the integration as desired.

When you ran this step, did you specify the Ceph admin key and keyring? https://rook.io/docs/rook/latest-release/CRDs/Cluster/external-cluster/?h=key#1-create-all-users-and-keys

@parth-gr might have some additional thoughts about this as well.

@achernya
Copy link
Author

@BlaineEXE you are correct the admin key is not present. Or rather, what it's in kubectl get secret -n rook-ceph-external rook-ceph-mon -o json is "admin-secret": "YWRtaW4tc2VjcmV0",, which base64-decodes to admin-secret. This is the behavior I see in the import script: https://github.com/rook/rook/blob/master/deploy/examples/import-external-cluster.sh#L30

I did not specify the admin key and keyring. I ran the export script with --rbd-data-pool, --cephfs-data-pool, and --format=bash. From my read of the documentation and create-external-cluster-resources.py I thought "optional" meant it would be auto-detected.

@achernya
Copy link
Author

achernya commented Mar 1, 2024

@BlaineEXE I tried passing --keyring and --ceph-conf as you suggested,

python3 ./rook/create-external-cluster-resources.py --rbd-data-pool-name=ssdpool --cephfs-data-pool-name=cephfs_data_ec --format=bash --output=no_key.sh
python3 ./rook/create-external-cluster-resources.py --rbd-data-pool-name=ssdpool --cephfs-data-pool-name=cephfs_data_ec --format=bash --output=key.sh --keyring=/etc/pve/priv/ceph.client.admin.keyring --ceph-conf=/etc/pve/ceph.conf
diff -u no_key.sh key.sh

and there is no output change.

@parth-gr
Copy link
Member

parth-gr commented Mar 1, 2024

@achernya you need to pass the --rgw-endpoint while running the python script,
See this https://rook.io/docs/rook/latest-release/CRDs/Cluster/external-cluster/#1-create-all-users-and-keys for more info

@achernya
Copy link
Author

achernya commented Mar 1, 2024

@parth-gr as I mentioned in my initial comment I do not have an existing radosgw configuration for this external ceph cluster, and my goal is to get rook to provision the radosgw inside the k8s cluster.

@parth-gr
Copy link
Member

parth-gr commented Mar 4, 2024

@achernya first of all I would like to ask why did you want this type confguration, if its external ceph it wont be rook responsibility to manage its daemons and I believe there would be checks in the code, if it is external then skip its management,

And still, if you want to test something out of the box I would say this we don't support.

But if you are interested in knowing how the creation can be possible,
You need to update all the caps to * for the health checker here, https://github.com/rook/rook/blob/master/deploy/examples/create-external-cluster-resources.py#L1007

"mon": " allow *"... like this,

Then the ROOK_EXTERNAL_USER_SECRET created would have the admin privilege.

then I think the rgw pool creation will be success.

parth-gr added a commit to parth-gr/rook that referenced this issue Mar 4, 2024
currently  the script requires to have both v2 and v1 port
to enable v2 port, but that is not the necessary condition,
so removing the chek, and enabling it only v2 is present to
successfully configure with v2 only

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
parth-gr added a commit to parth-gr/rook that referenced this issue Mar 4, 2024
sometimes user want to use the admin ower to create some resources
in the external ceph cluster, so adding a way to use the admin
privilege

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
mergify bot pushed a commit that referenced this issue Mar 4, 2024
currently  the script requires to have both v2 and v1 port
to enable v2 port, but that is not the necessary condition,
so removing the chek, and enabling it only v2 is present to
successfully configure with v2 only

part-of: #13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
(cherry picked from commit 117bc76)
@achernya
Copy link
Author

achernya commented Mar 4, 2024

@parth-gr

I would like to ask why did you want this type confguration

In my environment, I have a hyper-converged setup where the hypervisors have VMs with ceph-rbd storage, and I want the same ceph cluster to be used by the k8s environment. My underlying hypervisors (proxmox) don't set up rgw, as it would want/take advantage of loadbalancers. I was hoping to run the rgw portions of the system in k8s, where my loadbalancers already exist/can be easily set up.

then I think the rgw pool creation will be success

From my strace in my initial report, the creation command was explicitly looking for the client.admin keyring. That leads me to believe that simply granting client.healthchecker these privileges are necessary, but not sufficient to make this work.

parth-gr added a commit to parth-gr/rook that referenced this issue Mar 5, 2024
sometimes user want to use the admin ower to create some resources
in the external ceph cluster, so adding a way to use the admin
privilege

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
@parth-gr
Copy link
Member

parth-gr commented Mar 6, 2024

@achernya can you share the logs again, what it is complaining for now?

And restart the rook operator pod after this privileges changes, or sometimes it require node reboot where operator pod is running

parth-gr added a commit to parth-gr/rook that referenced this issue Mar 6, 2024
sometimes user want to use the admin ower to create some resources
in the external ceph cluster, so adding a way to use the admin
privilege

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
parth-gr added a commit to parth-gr/rook that referenced this issue Mar 7, 2024
sometimes user want to use the admin ower to create some resources
in the external ceph cluster, so adding a way to use the admin
privilege

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
parth-gr added a commit to parth-gr/rook that referenced this issue Mar 8, 2024
sometimes user want to use the admin ower to create some resources
in the external ceph cluster, so adding a way to use the admin
privilege

part-of: rook#13827

Signed-off-by: parth-gr <partharora1010@gmail.com>
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants