ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

mitchese · 2024-03-28T13:55:33Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
If a user creates an invalid objectbucketclaim with an integer for .spec.additionalConfig.maxObjects, this is accepted into the cluster. When the operator starts, it will log a warning repeatedly warning of an invalid bucketclaim:

E0328 13:00:32.118449   14728 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.4/tools/cache/reflector.go:229: Failed to watch *v1alpha1.ObjectBucketClaim: failed to list *v1alpha1.ObjectBucketClaim: json: cannot unmarshal number into Go struct field ObjectBucketClaimSpec.items.spec.additionalConfig of type string

and eventually timeout on start with a last error:

2024-03-28 13:02:52.055424 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: [failed to wait for ceph-block-pool-controller caches to sync: timed out waiting for cache to be synced for Kind *v1.CephBlockPool, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]

The rook operator then exits and the pod is restarted (crashLoopBackoff)

Expected behavior:
The objectbucketclaim should be rejected and/or the operator should be able to overlook this and start correctly, continuing to log warnings.

How to reproduce it (minimal and precise):
Create an objectbucketclaim ike this:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: this-breaks-cluster
spec:
  generateBucketName: sample
  storageClassName: general-s3
  additionalConfig:
    maxObjects: 1000   # should be maxObjects: "1000"

Restart the operator & observe logs and crashloopbackoff

This is tested on

$ rook version
rook: v1.13.5
go: go1.21.7

$ ceph versions
{
    "mon": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 2
    },
    "osd": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 366
    },
    "mds": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 2
    },
    "rgw": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 10
    },
    "overall": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 383
    }
}

Environment:

OS (e.g. from /etc/os-release): Ubuntu 22.04 / Flatcar 3850
Kernel (e.g. uname -a): 6.5 / 6.6
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): 1.13.5
Storage backend version (e.g. for ceph do ceph -v): 17.2.7
Kubernetes version (use kubectl version): 1.26.12
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): onprem bare metal
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ok

The text was updated successfully, but these errors were encountered:

parth-gr · 2024-03-28T14:46:22Z

@mitchese can you provide a string in the maxObjects and re-start the rook operator pod?

mitchese · 2024-03-28T15:05:06Z

Of course the string works. If you adjust the OBC to have maxObjects: "100" then on the next start of the operator the crashloop goes away and the operator runs.

The problem is, a random user of our cluster can create a broken yaml which is accepted (/not validated) by the CRD. This then kills the operator on the next restart and breaks all operator operations for all users until we discover/fix it. Worse, the log doesn't mention which OBC has this problem so our search is manual to find which namespace and OBC has this integer. And our inital search today was difficult because the offending OBC was created weeks ago, and the operator was only restarted recently.

We've got a gatekeeper rule in place as a bandaid for now, but this should be gracefully handled such that it doesn't fail.

parth-gr · 2024-03-28T15:15:08Z

Sure we should use the x-validation for the object-bucket CRD

mitchese added the bug label Mar 28, 2024

parth-gr added the good-first-issue Simple issues that are good for getting started with Rook. label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

mitchese commented Mar 28, 2024

parth-gr commented Mar 28, 2024 •

edited

mitchese commented Mar 28, 2024

parth-gr commented Mar 28, 2024 •

edited

ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

Comments

mitchese commented Mar 28, 2024

parth-gr commented Mar 28, 2024 • edited

mitchese commented Mar 28, 2024

parth-gr commented Mar 28, 2024 • edited

parth-gr commented Mar 28, 2024 •

edited

parth-gr commented Mar 28, 2024 •

edited