Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement(kms): cleanup AWS KMS aliases only if there are no VMs #8030

Merged
merged 4 commits into from
Jul 17, 2024

Conversation

vponomaryov
Copy link
Contributor

@vponomaryov vponomaryov commented Jul 16, 2024

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@vponomaryov vponomaryov added backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2 Ready for review labels Jul 16, 2024
@vponomaryov vponomaryov requested review from fruch and soyacz July 16, 2024 19:21
@vponomaryov
Copy link
Contributor Author

keep DB nodes: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-10gb-3h-test/118

./docker/env/hydra.sh --execute-on-runner 13.49.158.148 clean-resources --post-behavior --test-id 37cab0e7-d80e-41c5-a5bc-28ac62647313
...
20:26:31  Post behavior keep for db_nodes. Keep resources running
20:26:31  Post behavior keep for monitor_nodes. Keep resources running
20:26:31  Post behavior keep for loader_nodes. Keep resources running
20:26:31  Post behavior keep for k8s_cluster. Keep resources running
20:26:32  There are no instances to remove in AWS region eu-north-1
20:26:32  No capacity reservations to cancel.
20:26:32  There are no EIPs to remove in AWS region eu-north-1
20:26:33  There are no SGs to remove in AWS region eu-north-1
20:26:33  Skip AWS KMS alias deletion because DB nodes deletion was not scheduled
20:26:33  Cleanup for the {'TestId': '37cab0e7-d80e-41c5-a5bc-28ac62647313'} resources has been finished
20:26:34  Cleaning SSH agent
20:26:34  Agent pid 15491 killed
20:26:34  + echo 'Finished cleaning resources.'
20:26:34  Finished cleaning resources.

destroy DB nodes: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-10gb-3h-test/119

./docker/env/hydra.sh --execute-on-runner 16.170.243.50 clean-resources --post-behavior --test-id 5665e8b2-4723-496d-91f6-f6435c1f98ff
...
20:27:10  Post behavior destroy for db_nodes. Schedule cleanup
20:27:10  Post behavior destroy for monitor_nodes. Schedule cleanup
20:27:10  Post behavior destroy for loader_nodes. Schedule cleanup
20:27:10  Post behavior destroy for k8s_cluster. Schedule cleanup
20:27:10  Going to delete 'i-0b66c842f96a620d3' [name=longevity-10gb-3h-dev-loader-node-5665e8b2-2] 
20:27:11  Going to delete 'i-096ab304e17b7ba9b' [name=longevity-10gb-3h-dev-loader-node-5665e8b2-1] 
20:27:11  Going to delete 'i-072ad393b29600a65' [name=longevity-10gb-3h-dev-db-node-5665e8b2-7] 
20:27:11  Going to delete 'i-068f6c9efbb74dea6' [name=longevity-10gb-3h-dev-db-node-5665e8b2-2] 
20:27:12  Going to delete 'i-0ebe6424a0a4ab2d9' [name=longevity-10gb-3h-dev-db-node-5665e8b2-4] 
20:27:12  Going to delete 'i-05160e80556f3b2a2' [name=longevity-10gb-3h-dev-db-node-5665e8b2-3] 
20:27:12  Going to delete 'i-061822528bad6d4bf' [name=longevity-10gb-3h-dev-db-node-5665e8b2-6] 
20:27:13  Going to delete 'i-0e6c0527dd2358290' [name=longevity-10gb-3h-dev-db-node-5665e8b2-5] 
20:27:13  Going to delete 'i-0222716310d046267' [name=longevity-10gb-3h-dev-monitor-node-5665e8b2-1] 
20:27:14  No capacity reservations to cancel.
20:27:14  There are no EIPs to remove in AWS region eu-north-1
20:27:15  There are no SGs to remove in AWS region eu-north-1
20:27:16  Deleting the 'alias/testid-5665e8b2-4723-496d-91f6-f6435c1f98ff' alias in the KMS
20:27:16  Cleanup for the {'TestId': '5665e8b2-4723-496d-91f6-f6435c1f98ff'} resources has been finished
20:27:17  Cleaning SSH agent
20:27:17  Agent pid 15723 killed
20:27:17  + echo 'Finished cleaning resources.'
20:27:17  Finished cleaning resources.

Attempt to cleanup AWS KMS alias when DB nodes are still kept:

$ ./docker/env/hydra.sh clean-aws-kms-aliases -r eu-north-1 --time-delta-h 1
There is scylladb/hydra:v1.71-scylla-driver-3.26.9 in local cache, using it.
Going to run './sct.py  clean-aws-kms-aliases -r eu-north-1 --time-delta-h 1'...
...
KMS: Search for aliases older than '1' hours
KMS: [region 'eu-north-1'][key '07d2097b-3194-4dd3-a2d1-c9cee571f836'] read aliases
KMS: [region 'eu-north-1'][key '07d2097b-3194-4dd3-a2d1-c9cee571f836'] ignore the 'alias/qa-kms-key-for-rotation-1' alias as not matching
KMS: [region 'eu-north-1'][key '27a13c69-eb54-4db7-bd76-371ce43c4fb7'] read aliases
KMS: [region 'eu-north-1'][key '27a13c69-eb54-4db7-bd76-371ce43c4fb7'] ignore the 'alias/qa-kms-key-for-rotation-2' alias as not matching
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] read aliases
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] ignore the 'alias/qa-kms-key-for-rotation-3' alias as not matching
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] Found old alias -> 'alias/testid-37cab0e7-d80e-41c5-a5bc-28ac62647313' (2024-07-16 15:50:53.632000+00:00). Skip it because related DB nodes still exist.
KMS: finished cleaning up old aliases

Deletion of the alias after a separate deletion of DB nodes:

$ ./docker/env/hydra.sh clean-aws-kms-aliases -r eu-north-1 --time-delta-h 2
There is scylladb/hydra:v1.71-scylla-driver-3.26.9 in local cache, using it.
Going to run './sct.py  clean-aws-kms-aliases -r eu-north-1 --time-delta-h 2'...
...
KMS: Search for aliases older than '2' hours
KMS: [region 'eu-north-1'][key '07d2097b-3194-4dd3-a2d1-c9cee571f836'] read aliases
KMS: [region 'eu-north-1'][key '07d2097b-3194-4dd3-a2d1-c9cee571f836'] ignore the 'alias/qa-kms-key-for-rotation-1' alias as not matching
KMS: [region 'eu-north-1'][key '27a13c69-eb54-4db7-bd76-371ce43c4fb7'] read aliases
KMS: [region 'eu-north-1'][key '27a13c69-eb54-4db7-bd76-371ce43c4fb7'] ignore the 'alias/qa-kms-key-for-rotation-2' alias as not matching
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] read aliases
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] ignore the 'alias/qa-kms-key-for-rotation-3' alias as not matching
KMS: [region 'eu-north-1'][key 'fb8e450f-c075-443b-8d7c-03852249c4d9'] deleting old alias -> 'alias/testid-37cab0e7-d80e-41c5-a5bc-28ac62647313' (2024-07-16 15:50:53.632000+00:00)
Deleting the 'alias/testid-37cab0e7-d80e-41c5-a5bc-28ac62647313' alias in the KMS
KMS: finished cleaning up old aliases

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor is breaking the unittests

It will allow to be more flexible in importing other common utils
to avoid circular imports.
Make the AWS KMS alias cleanup logic check VMs presence and skip
deletion if some VMs have been found tagged with the same test-id.

It will allow to make kept DB clusters be workable during it's lifetime.
Before this change we were deleting AWS kMS alias of a test run
even if DB cluster was set to be kept ('keep' tag).
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return self.cleaner.gcloud.run(f"container clusters delete {self.name} --zone {self.zone} --quiet")


class GkeCleaner(GcloudContainerMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this was only moved here, but I'll leave comment for the future: Api of this class is very misleading and complicated.
GkeCleaner that doesn't have 'cleaning' methods - mostly listing resources (and e.g. list_gke_clusters returns list of GkeClusterForCleaner - so again mislead as it does not return gke cluster itself).

When introducing classes with simple goals like cleaning, let's make sure names, methods are consistent with this goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it is out of scope for this PR.
This PR is about fixing AWS KMS alias cleanups and dependency change with this moving of existing functionality.

@vponomaryov
Copy link
Contributor Author

vponomaryov commented Jul 17, 2024

It is needed as a fix only for enterprise branches, but will definitely make sense to backport to 6.0 and 5.4 just for ease of other backports reducing code diffs.
@fruch , @soyacz agree to also mark it for 6.0 and 5.4 backports?

@fruch
Copy link
Contributor

fruch commented Jul 17, 2024

It is needed as a fix only for enterprise branches, but will definitely make sense to backport to 6.0 and 5.4 just for ease of other backports reducing code diffs. @fruch , @soyacz agree to also mark it for 6.0 and 5.4 backports?

yes

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scylladbbot scylladbbot added backport/2024.2-done Commit backported to 2024.2 and removed backport/2024.2 Need backport to 2024.2 labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/5.4 Need backport to 5.4 backport/6.0 backport/2023.1 Need to backport to 2023.1 backport/2024.1 Need backport to 2024.1 backport/2024.2-done Commit backported to 2024.2 promoted-to-master Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants