Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OSD encryption key rotation #11749

Merged
merged 8 commits into from
Mar 8, 2023
Merged

Conversation

Rakshith-R
Copy link
Member

@Rakshith-R Rakshith-R commented Feb 23, 2023

Description of your changes:
This pr implements the following:


Cephcluster CR changes

parameters enabled: <true|false> and schedule: <cron_format, default to @weekly> are added to
security.keyRotation section of cephcluster spec here https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/Advanced/key-management-system.md.

security:
  kms:
    keyRotation: 
      enabled: "true"
      schedule: "@weekly" 
    connectionDetails:
      KMS_PROVIDER: vault
      ....
      tokenSecretName: rook-vault-token

KMS changes

Support for KMS.UpdateSecret() is added.
Currently only default type/ k8s secret is supported
It will be extended in future to support other KMS such as vault,ibm hpcs, kmip etc.


KEK Rotation logic would be as follows:

Step Operation Luks Slot 0 Luks Slot 1 Key in KMS
1 Obtain K1 K1 K1
2 Add K1 to slot 1 K1 K1 K1
3 Create K2 & add to slot 0 K2 K1 K1
4 Update K2 in KMS K2 K1 K2
5 Remove K1 from slot 1 K2 K2

Note: The above steps will ensure the KEK in kms will be able to open the encrypted device even if the operation is disrupted at any step and all the edge cases from disrupted processes are handled.

luksAddKey, luksChangeKey, luksKillSlot commands are being used to achieve this.

Refer: 10 Linux cryptsetup Examples for LUKS Key Management (How to Add, Remove, Change, Reset LUKS encryption Key)


KEK Rotation Cron Job

  • One Cron Job per encrypted PVC backed OSD is be created when key rotation is enabled with the given schedule.
  • The Cron Job will use OSD pod affinity requiredDuringScheduling.. using OSD's labels as selector to run on the same node as the OSD.
  • The Cron Job will share the host bridge directory with the OSD which contains the enrcypted devices mapped to be able to rotate the KEK.

Which issue is resolved by this Pull Request:
Resolves #7925

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@Rakshith-R Rakshith-R force-pushed the enc-key-rotation branch 3 times, most recently from 6e10122 to 5b5b513 Compare February 23, 2023 15:55
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some overall needs:

cmd/rook/secret.go Show resolved Hide resolved
cmd/rook/secret.go Outdated Show resolved Hide resolved
// Ensure currentKey is in slot 1.
for _, devicePath := range devicePaths {
logger.Debugf("adding the current key to slot %q of the device %q", slotOne, devicePath)
err = osd.AddEncryptionKey(context, devicePath, currentKey, currentKey, slotOne)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key is already in slot one, will this succeed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key is already in slot one, will this succeed?

yes,
if key matches with newPassphrase, ensureEncryptionKey will return true indicating match
if there's no match, we kill the slot and add the newpassphrase there.

return errors.Wrapf(err, "failed to get osd info for osd %q", osdDep.Name)
}
pvcName, ok := osdDep.Labels[OSDOverPVCLabelKey]
if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already know it has the label since the label was included above in the query?

@@ -541,6 +541,17 @@ func CreateDeployment(ctx context.Context, clientset kubernetes.Interface, dep *
return clientset.AppsV1().Deployments(dep.Namespace).Create(ctx, dep, metav1.CreateOptions{})
}

// CreateCronJob creates a cron job with a last applied hash annotation added
func CreateCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like nothing calls this. How about merging it into the CreateOrUpdateCronJob()? Or just reduce visibility

Suggested change
func CreateCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) {
func createCronJob(ctx context.Context, clientset kubernetes.Interface, cj *batchv1.CronJob) (*batchv1.CronJob, error) {


// Replace default unreachable node toleration if the osd pod is portable and based in PVC
if osdProps.portable {
k8sutil.AddUnreachableNodeToleration(&podTemplateSpec.Spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this toleration for the cron job, just for daemon pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this toleration for the cron job, just for daemon pods

I think it should apply to key rotation pod since it applies to OSD as well.
we want this pod to be eviction duration to be similar to that of OSD pod's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time the cronjob is triggered (e.g. every week), a new pod will be created and it complete quickly within a few seconds, correct? The pod has no need to be evicted since it will be in a completed state. If the OSD migrates to another node, the rotation pod will be scheduled in the new location automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it 👍

pkg/operator/ceph/cluster/osd/key_rotation.go Show resolved Hide resolved
func cliRotateSecret() *cobra.Command {
cmd := &cobra.Command{
Use: "rotate-key [kms-secret-key] [data-device] [metadata-device] [wal-device]",
Short: "Rotate a secret from a given KMS",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't rotating from a KMS yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't rotating from a KMS yet?

modified wording,

@Rakshith-R
Copy link
Member Author

Rakshith-R commented Feb 24, 2023

Thanks for the quick and detailed review ! 😄

Some overall needs:

  • Unit tests

added some.

  • Integration tests

I'l try to add this soon,
if not possible in this, I'll do it in a follow-up pr.

I've added the final design in a comment on the issue and in this pr description.


I've to still do the following things, I'll get to it on monday,

  • changes to helm values.yaml, helm docs, cephcluster example yamls (any pointers on which of these yamls need key rotation examples ? )
  • I'll go over again and see if I can add unit tests in any other places too.
  • intergration test if possible.
  • Add Logs from key rotation cron job run for reference.

Please review the changes so far and let me know if anything else needs to modified or explained.

@Rakshith-R Rakshith-R force-pushed the enc-key-rotation branch 5 times, most recently from 4baf0fa to 1b849bf Compare February 27, 2023 07:12
cmd/rook/secret.go Outdated Show resolved Hide resolved
cmd/rook/secret.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/key_rotation.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/key_rotation.go Outdated Show resolved Hide resolved
@Rakshith-R Rakshith-R force-pushed the enc-key-rotation branch 4 times, most recently from 21f5721 to 37f0be8 Compare February 27, 2023 11:09
@Rakshith-R
Copy link
Member Author

Thanks for the quick and detailed review ! smile

Some overall needs:

  • Unit tests

added some.

  • Integration tests

I'l try to add this soon, if not possible in this, I'll do it in a follow-up pr.

I've added the final design in a comment on the issue and in this pr description.

I've to still do the following things, I'll get to it on monday,

  • changes to helm values.yaml, helm docs, cephcluster example yamls (any pointers on which of these yamls need key rotation examples ? )

added key rotation example in cluster-on-pvc.yaml (only this contained .spec.security section).

  • I'll go over again and see if I can add unit tests in any other places too.

done.

  • intergration test if possible.

I've added this alongside canary-integration-test : encryption-pvc-db-wal :.
This will check for new key in secret and the next step of osd deployment removal will check that new key can open
the encrypted device.

  • Add Logs from key rotation cron job run for reference.

Please review the changes so far and let me know if anything else needs to modified or explained.

click here to expand key rotation pod logs:

runner@fv-az275-887:~/work/rook/rook$ k logs rook-ceph-osd-key-rotation-0-27958250-gmtfq -n rook-ceph
2023-02-27 10:50:12.292968 I | cephosd: fetching the current key
2023-02-27 10:50:12.296243 I | cephosd: adding the current key to slot "1" of the device "/var/lib/ceph/osd/block-tmp"
2023-02-27 10:50:12.296759 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/325846276 --key-slot=1 luksAddKey /var/lib/ceph/osd/block-tmp /tmp/1585851795
2023-02-27 10:50:36.971463 I | cephosd: adding the current key to slot "1" of the device "/var/lib/ceph/osd/block.db-tmp"
2023-02-27 10:50:36.971735 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/739593765 --key-slot=1 luksAddKey /var/lib/ceph/osd/block.db-tmp /tmp/1860780317
2023-02-27 10:50:46.211929 I | cephosd: adding the current key to slot "1" of the device "/var/lib/ceph/osd/block.wal-tmp"
2023-02-27 10:50:46.212323 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/2048944663 --key-slot=1 luksAddKey /var/lib/ceph/osd/block.wal-tmp /tmp/2041810508
2023-02-27 10:51:19.710455 I | cephosd: generating new key
2023-02-27 10:51:19.710497 I | cephosd: removing key slot "0", if found, of the device "/var/lib/ceph/osd/block-tmp"
2023-02-27 10:51:19.710660 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/4214309798 luksKillSlot /var/lib/ceph/osd/block-tmp 0
2023-02-27 10:51:23.094666 I | cephosd: adding new key to slot "0" of the device "/var/lib/ceph/osd/block-tmp"
2023-02-27 10:51:23.095229 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/104985237 --key-slot=0 luksAddKey /var/lib/ceph/osd/block-tmp /tmp/1275894378
2023-02-27 10:51:33.981403 I | cephosd: removing key slot "0", if found, of the device "/var/lib/ceph/osd/block.db-tmp"
2023-02-27 10:51:33.982012 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/2042354239 luksKillSlot /var/lib/ceph/osd/block.db-tmp 0
2023-02-27 10:51:36.526831 I | cephosd: adding new key to slot "0" of the device "/var/lib/ceph/osd/block.db-tmp"
2023-02-27 10:51:36.527194 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/444230199 --key-slot=0 luksAddKey /var/lib/ceph/osd/block.db-tmp /tmp/1825221398
2023-02-27 10:52:01.612414 I | cephosd: removing key slot "0", if found, of the device "/var/lib/ceph/osd/block.wal-tmp"
2023-02-27 10:52:01.617195 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/1125153552 luksKillSlot /var/lib/ceph/osd/block.wal-tmp 0
2023-02-27 10:52:04.531993 I | cephosd: adding new key to slot "0" of the device "/var/lib/ceph/osd/block.wal-tmp"
2023-02-27 10:52:04.532709 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/3174693213 --key-slot=0 luksAddKey /var/lib/ceph/osd/block.wal-tmp /tmp/2128256523
2023-02-27 10:52:22.727476 I | cephosd: updating the new key in the KMS
2023-02-27 10:52:22.756352 I | cephosd: fetching the key from the KMS to verify it.
2023-02-27 10:52:22.759071 I | cephosd: removing the old key from the slot "1" of the device "/var/lib/ceph/osd/block-tmp"
2023-02-27 10:52:22.759371 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/1029750050 luksKillSlot /var/lib/ceph/osd/block-tmp 1
2023-02-27 10:52:24.448413 I | cephosd: removing the old key from the slot "1" of the device "/var/lib/ceph/osd/block.db-tmp"
2023-02-27 10:52:24.448854 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/3894365656 luksKillSlot /var/lib/ceph/osd/block.db-tmp 1
2023-02-27 10:52:26.643961 I | cephosd: removing the old key from the slot "1" of the device "/var/lib/ceph/osd/block.wal-tmp"
2023-02-27 10:52:26.644270 D | exec: Running command: cryptsetup --verbose --key-file=/tmp/2091646819 luksKillSlot /var/lib/ceph/osd/block.wal-tmp 1
2023-02-27 10:52:29.956407 I | cephosd: Successfully rotated the key

pkg/daemon/ceph/osd/kms/k8s.go Show resolved Hide resolved
pkg/operator/ceph/cluster/osd/key_rotation.go Show resolved Hide resolved

// Replace default unreachable node toleration if the osd pod is portable and based in PVC
if osdProps.portable {
k8sutil.AddUnreachableNodeToleration(&podTemplateSpec.Spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time the cronjob is triggered (e.g. every week), a new pod will be created and it complete quickly within a few seconds, correct? The pod has no need to be evicted since it will be in a completed state. If the OSD migrates to another node, the rotation pod will be scheduled in the new location automatically.

@@ -792,6 +787,7 @@ jobs:
cat tests/manifests/test-on-pvc-db.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml
cat tests/manifests/test-on-pvc-wal.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml
kubectl create -f tests/manifests/test-cluster-on-pvc-encrypted.yaml
kubectl patch -n rook-ceph cephcluster rook-ceph --type merge -p '{"spec":{"security":{"keyRotation":{"enabled": true, "schedule":"*/5 * * * *"}}}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will the key rotation be tested here? 5 times per minute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will the key rotation be tested here? 5 times per minute?

once every 5 minutes. https://crontab.guru/#*/5__**

the script function verify_key_rotation
checks secret for change in key value every 20 seconds for 10 minutes.
once it is changed,
test osd deployment removal and re-hydration test will take care of check of opening encrypted device with new key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.

I've changed this to rotate every 3 minutes now,
( k8s cronjob aren't guaranteed to run if the interval is < ~ 1 minute )
and max wait for 7 minutes.

I think ~2-5 minutes add to the ci is fine since there are other tests which are running in parallel and take more time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If less than one minute is a problem, why not rotate every 2 min? :) I know other jobs are longer, but if there are other changes in the future to this job, then it could become the longest.

@@ -12,6 +13,12 @@ The `security` section contains settings related to encryption of the cluster.
* `kms`: Key Management System settings
* `connectionDetails`: the list of parameters representing kms connection details
* `tokenSecretName`: the name of the Kubernetes Secret containing the kms authentication token
* `keyRotation`: Key Rotation settings
* `enabled`: whether key rotation is enabled or not, default is `false`
* `schedule`: the cron schedule for key rotation, default is `@weekly`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link to a page with the cron schedule format? Perhaps like the K8s cronjob documentation, we could link to this page: https://en.wikipedia.org/wiki/Cron

* `schedule`: the cron schedule for key rotation, default is `@weekly`

!!! note
Currently key rotation is only supported for default type, where the Key Encryption Keys are stored in a Kubernetes Secret.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently key rotation is only supported for default type, where the Key Encryption Keys are stored in a Kubernetes Secret.
Currently key rotation is only supported for the default type, where the Key Encryption Keys are stored in a Kubernetes Secret.

deploy/examples/common-second-cluster.yaml Show resolved Hide resolved
// Fetch key to verify its the new key.
keyInKMS, err := kms.GetSecret(secretName)
if keyInKMS != newKey {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most places where we return an error, we need to wrap the error with more details so the issue can be more easily troubleshooted. Also check other places where err is returned.

Suggested change
return err
return errors.Wrap(err, "failed to verify secret)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked all the places and modified where ever necessary 👍

}

// applyKeyRotationPlacement applies the placement settings for the key rotation job
// so that it is scheduled on the same node as the pod containing the given labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This comment would help be clear about the intent of running with the OSD.

Suggested change
// so that it is scheduled on the same node as the pod containing the given labels.
// so that it is scheduled on the same node as the OSD for which the key rotation is scheduled.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final questions

pkg/daemon/ceph/osd/key_rotation.go Show resolved Hide resolved
secret.StringData = map[string]string{OsdEncryptionSecretNameKeyName: key}
// Update the Kubernetes Secret
_, err = c.context.Clientset.CoreV1().Secrets(c.ClusterInfo.Namespace).Update(c.ClusterInfo.Context, secret, metav1.UpdateOptions{})
if err != nil && !kerrors.IsAlreadyExists(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Update() ever return an error that would result in already exists? I thought that would just be after calling Create()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its not required. I must have missed removing it from my earlier trials, removed it now.
thanks.

pkg/operator/ceph/cluster/osd/key_rotation.go Show resolved Hide resolved
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, just a final suggestion for the CI efficiency

@@ -792,6 +787,7 @@ jobs:
cat tests/manifests/test-on-pvc-db.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml
cat tests/manifests/test-on-pvc-wal.yaml >> tests/manifests/test-cluster-on-pvc-encrypted.yaml
kubectl create -f tests/manifests/test-cluster-on-pvc-encrypted.yaml
kubectl patch -n rook-ceph cephcluster rook-ceph --type merge -p '{"spec":{"security":{"keyRotation":{"enabled": true, "schedule":"*/5 * * * *"}}}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to rotate every minute? Then the CI wait can be much faster. I see the CI is waiting 2.5 minutes for this step, which would be nice to reduce.

exit 0
fi
echo "encryption passphrase is not rotated, sleeping for 30 seconds"
sleep 30s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're rotating every minute, we can reduce this sleep.

Suggested change
sleep 30s
sleep 10s

@Rakshith-R Rakshith-R force-pushed the enc-key-rotation branch 2 times, most recently from c0482f5 to a158c57 Compare March 8, 2023 06:37
This commit adds KeyRotationSpec to cephcluster CR
to support encryption key rotation feature.
If enabled Rook will create a cronjob for each
encrypted PVC based OSD with given schedule set
to rotate their respective key encryption key.

Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds util functions to support encryption key rotation
such as RemoveEncryptionKeySlot, AddEncryptionKey and
ensureEncryptionKey.

Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds support for updating secret for default KMS.
This is required for encryption key rotation.

Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds functionality to be able to rotate
key encryption key of encrypted PVC backed OSDs.
Necessary changes such as adding update functionality
to kms and rbac changes are made as well.

Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds util function to create key rotation cron job
such as getKeyRotationContainer, getKeyRotationCronPodTemplateSpec
and makeKeyRotationCronJob.

Signed-off-by: Rakshith R <rar@redhat.com>
This commits adds code to reconcile key rotation cron jobs.

Signed-off-by: Rakshith R <rar@redhat.com>
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds test for key rotation.

Signed-off-by: Rakshith R <rar@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement encryption key rotation for cluster-wide (OSDs) encryption
4 participants