Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

security: Run the crash collector as ceph user #11219

Merged
merged 1 commit into from Oct 31, 2022

Conversation

travisn
Copy link
Member

@travisn travisn commented Oct 26, 2022

Description of your changes:
The crash collector does not have the command line arguments to run as ceph user id 167, so we set the security context to run as the ceph user in the main crash collector container.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this is something the Ceph image would fix as well in its DockerFile, but that approach LGTM. Also, all the container specs could potentially use this as a default and we could stop using "run-as" flags altogether. Something to keep in mind for a future refactor maybe.
Last question, is this fixing a bug, or is it "just" to run non-root? Final point, I think the log rotate sidecar as the same problem.

func CephSecurityContext() *v1.SecurityContext {
// The ceph user has user id 167
context := PodSecurityContext()
context.RunAsUser = pointer.Int64(167)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IIRC there is a constant somewhere like CephUserID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick search I didn't find one, but I'll look again, thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find it, so defined one

@lerminou
Copy link

lerminou commented Oct 27, 2022

Ideally, this is something the Ceph image would fix as well in its DockerFile,

Yes there is a security issue in their backlog for this, I sent an email yesterday to Travis to mitigate the CVE directly in Rook
I don't know if there is some logic in Ceph behind this, it seems "they suffers some delays in handling the issue"

Last question, is this fixing a bug, or is it "just" to run non-root?

yes a CVE was discovered recently in Ceph, I don't know if I can link it here ?

Copy link
Member Author

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this is something the Ceph image would fix as well in its DockerFile, but that approach LGTM. Also, all the container specs could potentially use this as a default and we could stop using "run-as" flags altogether. Something to keep in mind for a future refactor maybe.

I was thinking about removing the run-as flags and using the security context like this in a separate PR. It does seem like a better approach if we can just use the K8s context and no need for the ceph flags.

Last question, is this fixing a bug, or is it "just" to run non-root? Final point, I think the log rotate sidecar as the same problem.

Thanks, I'll look at the log rotate sidecar too.

func CephSecurityContext() *v1.SecurityContext {
// The ceph user has user id 167
context := PodSecurityContext()
context.RunAsUser = pointer.Int64(167)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick search I didn't find one, but I'll look again, thanks

@leseb
Copy link
Member

leseb commented Oct 27, 2022

Ideally, this is something the Ceph image would fix as well in its DockerFile, but that approach LGTM. Also, all the container specs could potentially use this as a default and we could stop using "run-as" flags altogether. Something to keep in mind for a future refactor maybe.

I was thinking about removing the run-as flags and using the security context like this in a separate PR. It does seem like a better approach if we can just use the K8s context and no need for the ceph flags.

After thoughts, this won't be super trivial though, I had an attempt here #9175 but never had time to finish it.
I had most of working IIRC, but it's tricky so be careful :).

Last question, is this fixing a bug, or is it "just" to run non-root? Final point, I think the log rotate sidecar as the same problem.

Thanks, I'll look at the log rotate sidecar too.

The crash collector does not have the command line arguments
to run as ceph user id 167, so we set the security context
to run as the ceph user in the main crash collector
container.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
@travisn
Copy link
Member Author

travisn commented Oct 27, 2022

Thanks, I'll look at the log rotate sidecar too.

The log rotate sidecar will take a bit more investigation, will work on it in a follow-up issue with #11235.

@leseb
Copy link
Member

leseb commented Oct 28, 2022

Thanks, I'll look at the log rotate sidecar too.

The log rotate sidecar will take a bit more investigation, will work on it in a follow-up issue with #11235.

Yes, I think my old PR should have what's necessary to fix it, feel free to ping me when you tackle it.

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, did you verify that the crash files can be uploaded by the ceph-crash script when they appear? Normally the crash logs should be owned by ceph so this should not be a problem. Better safe than sorry though.

@travisn
Copy link
Member Author

travisn commented Oct 28, 2022

One last thing, did you verify that the crash files can be uploaded by the ceph-crash script when they appear? Normally the crash logs should be owned by ceph so this should not be a problem. Better safe than sorry though.

Verified both in minikube and openshift. If I look at the crash directory, the crash moved into the posted subdir, and the toolbox shows the crash was reported:

sh-4.4$ ceph crash ls
ID                                                                ENTITY  NEW  
2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554  mon.a    *   

However, the logging in the crash collector pod makes it look like the operator failed. But I see this same log whether the crash collector is running as ceph or root (the same behavior before or after this fix). @leseb Have you seen this before?

INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
WARNING:ceph-crash:post /var/lib/ceph/crash/2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554 as client.crash.rook-ceph-crashcollector-7949e6200e610c9b3d040f70a100f3f4-cq5rx failed: (None, b'2022-10-28T17:36:22.538+0000 7f14db7fe700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n2022-10-28T17:36:22.540+0000 7f14e0edb700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n2022-10-28T17:36:22.543+0000 7f14dbfff700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n[errno 13] RADOS permission denied (error connecting to the cluster)\n')
WARNING:ceph-crash:post /var/lib/ceph/crash/2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554 as client.crash failed: (None, b

@travisn travisn requested a review from leseb October 28, 2022 17:58
@leseb
Copy link
Member

leseb commented Oct 31, 2022

One last thing, did you verify that the crash files can be uploaded by the ceph-crash script when they appear? Normally the crash logs should be owned by ceph so this should not be a problem. Better safe than sorry though.

Verified both in minikube and openshift. If I look at the crash directory, the crash moved into the posted subdir, and the toolbox shows the crash was reported:

sh-4.4$ ceph crash ls
ID                                                                ENTITY  NEW  
2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554  mon.a    *   

However, the logging in the crash collector pod makes it look like the operator failed. But I see this same log whether the crash collector is running as ceph or root (the same behavior before or after this fix). @leseb Have you seen this before?

INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
WARNING:ceph-crash:post /var/lib/ceph/crash/2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554 as client.crash.rook-ceph-crashcollector-7949e6200e610c9b3d040f70a100f3f4-cq5rx failed: (None, b'2022-10-28T17:36:22.538+0000 7f14db7fe700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n2022-10-28T17:36:22.540+0000 7f14e0edb700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n2022-10-28T17:36:22.543+0000 7f14dbfff700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]\n[errno 13] RADOS permission denied (error connecting to the cluster)\n')
WARNING:ceph-crash:post /var/lib/ceph/crash/2022-10-28T17:31:32.037827Z_c5308e72-cdee-4c4d-8158-e24476180554 as client.crash failed: (None, b

Yes, this is very confusing but harmless. Essentially, the ceph-crash tries a bunch of users, and some fail but eventually one succeeds. That's why you see the crash being present in Ceph.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏 👏 👏 👏 👏 👏 👏 👏

@travisn travisn merged commit bb91d56 into rook:master Oct 31, 2022
50 checks passed
mergify bot added a commit that referenced this pull request Oct 31, 2022
security: Run the crash collector as ceph user (backport #11219)
mergify bot added a commit that referenced this pull request Oct 31, 2022
security: Run the crash collector as ceph user (backport #11219)
@travisn travisn deleted the crash-collector-user branch November 1, 2022 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants