Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectBucketClaim with maxObjects incorrectly set to an integer will break the operator & prevent all new object creation #13989

Open
mitchese opened this issue Mar 28, 2024 · 3 comments
Labels
bug good-first-issue Simple issues that are good for getting started with Rook.

Comments

@mitchese
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
If a user creates an invalid objectbucketclaim with an integer for .spec.additionalConfig.maxObjects, this is accepted into the cluster. When the operator starts, it will log a warning repeatedly warning of an invalid bucketclaim:

E0328 13:00:32.118449   14728 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.4/tools/cache/reflector.go:229: Failed to watch *v1alpha1.ObjectBucketClaim: failed to list *v1alpha1.ObjectBucketClaim: json: cannot unmarshal number into Go struct field ObjectBucketClaimSpec.items.spec.additionalConfig of type string

and eventually timeout on start with a last error:

2024-03-28 13:02:52.055424 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: [failed to wait for ceph-block-pool-controller caches to sync: timed out waiting for cache to be synced for Kind *v1.CephBlockPool, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]

The rook operator then exits and the pod is restarted (crashLoopBackoff)

Expected behavior:
The objectbucketclaim should be rejected and/or the operator should be able to overlook this and start correctly, continuing to log warnings.

How to reproduce it (minimal and precise):
Create an objectbucketclaim ike this:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: this-breaks-cluster
spec:
  generateBucketName: sample
  storageClassName: general-s3
  additionalConfig:
    maxObjects: 1000   # should be maxObjects: "1000"

Restart the operator & observe logs and crashloopbackoff

This is tested on

$ rook version
rook: v1.13.5
go: go1.21.7

$ ceph versions
{
    "mon": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 2
    },
    "osd": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 366
    },
    "mds": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 2
    },
    "rgw": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 10
    },
    "overall": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 383
    }
}

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 22.04 / Flatcar 3850
  • Kernel (e.g. uname -a): 6.5 / 6.6
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod): 1.13.5
  • Storage backend version (e.g. for ceph do ceph -v): 17.2.7
  • Kubernetes version (use kubectl version): 1.26.12
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): onprem bare metal
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ok
@mitchese mitchese added the bug label Mar 28, 2024
@parth-gr
Copy link
Member

parth-gr commented Mar 28, 2024

@mitchese can you provide a string in the maxObjects and re-start the rook operator pod?

@mitchese
Copy link
Author

Of course the string works. If you adjust the OBC to have maxObjects: "100" then on the next start of the operator the crashloop goes away and the operator runs.

The problem is, a random user of our cluster can create a broken yaml which is accepted (/not validated) by the CRD. This then kills the operator on the next restart and breaks all operator operations for all users until we discover/fix it. Worse, the log doesn't mention which OBC has this problem so our search is manual to find which namespace and OBC has this integer. And our inital search today was difficult because the offending OBC was created weeks ago, and the operator was only restarted recently.

We've got a gatekeeper rule in place as a bandaid for now, but this should be gracefully handled such that it doesn't fail.

@parth-gr
Copy link
Member

parth-gr commented Mar 28, 2024

Sure we should use the x-validation for the object-bucket CRD

@parth-gr parth-gr added the good-first-issue Simple issues that are good for getting started with Rook. label Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good-first-issue Simple issues that are good for getting started with Rook.
Projects
None yet
Development

No branches or pull requests

2 participants