-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement OSD encryption key rotation #11749
Conversation
6e10122
to
5b5b513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some overall needs:
- Unit tests
- Integration tests
- Design documentation. Where can we see an overview of the key rotation? At least we need a final design in Implement encryption key rotation for cluster-wide (OSDs) encryption #7925 even if not a full design doc.
Documentation/Storage-Configuration/Advanced/key-management-system.md
Outdated
Show resolved
Hide resolved
cmd/rook/secret.go
Outdated
// Ensure currentKey is in slot 1. | ||
for _, devicePath := range devicePaths { | ||
logger.Debugf("adding the current key to slot %q of the device %q", slotOne, devicePath) | ||
err = osd.AddEncryptionKey(context, devicePath, currentKey, currentKey, slotOne) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the key is already in slot one, will this succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the key is already in slot one, will this succeed?
yes,
if key matches with newPassphrase, ensureEncryptionKey will return true indicating match
if there's no match, we kill the slot and add the newpassphrase there.
return errors.Wrapf(err, "failed to get osd info for osd %q", osdDep.Name) | ||
} | ||
pvcName, ok := osdDep.Labels[OSDOverPVCLabelKey] | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already know it has the label since the label was included above in the query?
pkg/operator/k8sutil/deployment.go
Outdated
@@ -541,6 +541,17 @@ func CreateDeployment(ctx context.Context, clientset kubernetes.Interface, dep * | |||
return clientset.AppsV1().Deployments(dep.Namespace).Create(ctx, dep, metav1.CreateOptions{}) | |||
} | |||
|
|||
// CreateCronJob creates a cron job with a last applied hash annotation added | |||
func CreateCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like nothing calls this. How about merging it into the CreateOrUpdateCronJob()? Or just reduce visibility
func CreateCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) { | |
func createCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) { |
|
||
// Replace default unreachable node toleration if the osd pod is portable and based in PVC | ||
if osdProps.portable { | ||
k8sutil.AddUnreachableNodeToleration(&podTemplateSpec.Spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this toleration for the cron job, just for daemon pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this toleration for the cron job, just for daemon pods
I think it should apply to key rotation pod since it applies to OSD as well.
we want this pod to be eviction duration to be similar to that of OSD pod's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time the cronjob is triggered (e.g. every week), a new pod will be created and it complete quickly within a few seconds, correct? The pod has no need to be evicted since it will be in a completed state. If the OSD migrates to another node, the rotation pod will be scheduled in the new location automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it 👍
cmd/rook/secret.go
Outdated
func cliRotateSecret() *cobra.Command { | ||
cmd := &cobra.Command{ | ||
Use: "rotate-key [kms-secret-key] [data-device] [metadata-device] [wal-device]", | ||
Short: "Rotate a secret from a given KMS", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we weren't rotating from a KMS yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we weren't rotating from a KMS yet?
modified wording,
5b5b513
to
699080d
Compare
Thanks for the quick and detailed review ! 😄
added some.
I'l try to add this soon,
I've added the final design in a comment on the issue and in this pr description. I've to still do the following things, I'll get to it on monday,
Please review the changes so far and let me know if anything else needs to modified or explained. |
4baf0fa
to
1b849bf
Compare
21f5721
to
37f0be8
Compare
added key rotation example in cluster-on-pvc.yaml (only this contained
done.
I've added this alongside canary-integration-test : encryption-pvc-db-wal :.
|
|
||
// Replace default unreachable node toleration if the osd pod is portable and based in PVC | ||
if osdProps.portable { | ||
k8sutil.AddUnreachableNodeToleration(&podTemplateSpec.Spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time the cronjob is triggered (e.g. every week), a new pod will be created and it complete quickly within a few seconds, correct? The pod has no need to be evicted since it will be in a completed state. If the OSD migrates to another node, the rotation pod will be scheduled in the new location automatically.
@@ -792,6 +787,7 @@ jobs: | |||
cat tests/manifests/test-on-pvc-db.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
cat tests/manifests/test-on-pvc-wal.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
kubectl create -f tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
kubectl patch -n rook-ceph cephcluster rook-ceph --type merge -p '{"spec":{"security":{"keyRotation":{"enabled": true, "schedule":"*/5 * * * *"}}}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often will the key rotation be tested here? 5 times per minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often will the key rotation be tested here? 5 times per minute?
once every 5 minutes. https://crontab.guru/#*/5__**
the script function verify_key_rotation
checks secret for change in key value every 20 seconds for 10 minutes.
once it is changed,
test osd deployment removal and re-hydration
test will take care of check of opening encrypted device with new key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.
I've changed this to rotate every 3 minutes now,
( k8s cronjob aren't guaranteed to run if the interval is < ~ 1 minute )
and max wait for 7 minutes.
I think ~2-5 minutes add to the ci is fine since there are other tests which are running in parallel and take more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If less than one minute is a problem, why not rotate every 2 min? :) I know other jobs are longer, but if there are other changes in the future to this job, then it could become the longest.
@@ -12,6 +13,12 @@ The `security` section contains settings related to encryption of the cluster. | |||
* `kms`: Key Management System settings | |||
* `connectionDetails`: the list of parameters representing kms connection details | |||
* `tokenSecretName`: the name of the Kubernetes Secret containing the kms authentication token | |||
* `keyRotation`: Key Rotation settings | |||
* `enabled`: whether key rotation is enabled or not, default is `false` | |||
* `schedule`: the cron schedule for key rotation, default is `@weekly` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link to a page with the cron schedule format? Perhaps like the K8s cronjob documentation, we could link to this page: https://en.wikipedia.org/wiki/Cron
* `schedule`: the cron schedule for key rotation, default is `@weekly` | ||
|
||
!!! note | ||
Currently key rotation is only supported for default type, where the Key Encryption Keys are stored in a Kubernetes Secret. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently key rotation is only supported for default type, where the Key Encryption Keys are stored in a Kubernetes Secret. | |
Currently key rotation is only supported for the default type, where the Key Encryption Keys are stored in a Kubernetes Secret. |
pkg/daemon/ceph/osd/key_rotation.go
Outdated
// Fetch key to verify its the new key. | ||
keyInKMS, err := kms.GetSecret(secretName) | ||
if keyInKMS != newKey { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most places where we return an error, we need to wrap the error with more details so the issue can be more easily troubleshooted. Also check other places where err is returned.
return err | |
return errors.Wrap(err, "failed to verify secret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked all the places and modified where ever necessary 👍
} | ||
|
||
// applyKeyRotationPlacement applies the placement settings for the key rotation job | ||
// so that it is scheduled on the same node as the pod containing the given labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This comment would help be clear about the intent of running with the OSD.
// so that it is scheduled on the same node as the pod containing the given labels. | |
// so that it is scheduled on the same node as the OSD for which the key rotation is scheduled. |
37f0be8
to
aaca8fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few final questions
pkg/daemon/ceph/osd/kms/k8s.go
Outdated
secret.StringData = map[string]string{OsdEncryptionSecretNameKeyName: key} | ||
// Update the Kubernetes Secret | ||
_, err = c.context.Clientset.CoreV1().Secrets(c.ClusterInfo.Namespace).Update(c.ClusterInfo.Context, secret, metav1.UpdateOptions{}) | ||
if err != nil && !kerrors.IsAlreadyExists(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Update()
ever return an error that would result in already exists
? I thought that would just be after calling Create()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, its not required. I must have missed removing it from my earlier trials, removed it now.
thanks.
aaca8fe
to
8389ac6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good, just a final suggestion for the CI efficiency
@@ -792,6 +787,7 @@ jobs: | |||
cat tests/manifests/test-on-pvc-db.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
cat tests/manifests/test-on-pvc-wal.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
kubectl create -f tests/manifests/test-cluster-on-pvc-encrypted.yaml | |||
kubectl patch -n rook-ceph cephcluster rook-ceph --type merge -p '{"spec":{"security":{"keyRotation":{"enabled": true, "schedule":"*/5 * * * *"}}}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.
exit 0 | ||
fi | ||
echo "encryption passphrase is not rotated, sleeping for 30 seconds" | ||
sleep 30s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're rotating every minute, we can reduce this sleep.
sleep 30s | |
sleep 10s |
8389ac6
to
c7af478
Compare
c0482f5
to
a158c57
Compare
This commit adds KeyRotationSpec to cephcluster CR to support encryption key rotation feature. If enabled Rook will create a cronjob for each encrypted PVC based OSD with given schedule set to rotate their respective key encryption key. Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds util functions to support encryption key rotation such as RemoveEncryptionKeySlot, AddEncryptionKey and ensureEncryptionKey. Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds support for updating secret for default KMS. This is required for encryption key rotation. Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds functionality to be able to rotate key encryption key of encrypted PVC backed OSDs. Necessary changes such as adding update functionality to kms and rbac changes are made as well. Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds util function to create key rotation cron job such as getKeyRotationContainer, getKeyRotationCronPodTemplateSpec and makeKeyRotationCronJob. Signed-off-by: Rakshith R <rar@redhat.com>
This commits adds code to reconcile key rotation cron jobs. Signed-off-by: Rakshith R <rar@redhat.com>
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds test for key rotation. Signed-off-by: Rakshith R <rar@redhat.com>
a158c57
to
b9fc9a6
Compare
Implement OSD encryption key rotation (backport #11749)
Description of your changes:
This pr implements the following:
core: add KeyRotationSpec to cephcluster CR
osd: add util functions to support encryption key rotation
ceph: add support for updating secret for default KMS
osd: add rotate-key functionality to rook's key-management cmd
ceph: add util function to create key rotation cron job
ceph: add capability to reconcile key rotation cron jobs
Cephcluster CR changes
parameters
enabled: <true|false>
andschedule: <cron_format, default to @weekly>
are added tosecurity.keyRotation
section of cephcluster spec here https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/Advanced/key-management-system.md.KMS changes
Support for
KMS.UpdateSecret()
is added.Currently only default type/ k8s secret is supported
It will be extended in future to support other KMS such as vault,ibm hpcs, kmip etc.
KEK Rotation logic would be as follows:
Note: The above steps will ensure the KEK in kms will be able to open the encrypted device even if the operation is disrupted at any step and all the edge cases from disrupted processes are handled.
luksAddKey, luksChangeKey, luksKillSlot
commands are being used to achieve this.Refer: 10 Linux cryptsetup Examples for LUKS Key Management (How to Add, Remove, Change, Reset LUKS encryption Key)
KEK Rotation Cron Job
requiredDuringScheduling..
using OSD's labels as selector to run on the same node as the OSD.Which issue is resolved by this Pull Request:
Resolves #7925
Checklist:
skip-ci
on the PR.