-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CNPG Hibernation functionality #720
Conversation
… from following the documentation for the CNPG declarative hibernation feature: https://cloudnative-pg.io/documentation/current/declarative_hibernation/ This introduces the cnpg.io/hibernation annotation and sets it to "on" when the spec.stop attribute is true. Currently this only circumvents reconcile blocks which depend on the cluster to be running in some way, but I suspect it would be possible to be more aggressive depending on how much extra operator overhead we want to remove. This reactivates the "deprecated" code surrounding the spec.stop attribute; former attempts to use this flag resulted in a cluster permanently stuck in a warning state.
…ster. Also fixed a possible bug where the pods_to_fence variable masks a function of the same name. Unfortunately this will not work as written. Once we unset the "stop" flag, all of our existing circumvention will fail to restart the cluster because of the current fencing code. The previous spec.stop state will be lost, and the operator will immediately try to fence pods that don't exist. This means the fencing code needs to check for the hibernation flag. This code will be added later.
Fixed the issue where the cluster can't thaw after spec.stop is set to false. The issue was that pods_to_fence is designed to be called before cnpg_cluster_from_cdb where we toggle the hibernation annotation. This function implicitly seeks a Primary pod, but since there are no pods, it simply enters a permanent retry loop while expecting the Primary pod to appear. Now we check directly for the hibernation annotation here and in reconcile_cnpg to guarantee a code path to the annotation.
Added a test for the new hibernation logic. This test will create a new cluster, put it into hibernation state, and remove hibernation, while verifying each. I also added a simple class to help us eventually remove a lot of the boilerplate code all the tests currently seem to include. The functional_test_hibernate test is an example implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, cleanly done. I like how the new code is clearly separated from existing. It seems to have bare-minimum touching of the existing code, which is perfect, i.e. the minimum number of lines modified to get the job done, and new code is in new functions. Beautiful!
Steve fixed a typo! Co-authored-by: Steven Miller <sjmiller609@gmail.com>
Converted the basic-pgsql test to use the TestCore base struct. A focused invocation shows this test is still functional post conversion. Removed state field from TestCore struct, as it's just a means of obtaining an Arc context. Further investigation shows the "pods" attribute almost never comes without a named pod instance, but it is used in isolation in nested calls. Later refactors should account for this and add another field to the struct for this named pod instance (usually named {cluster-name}-1). The TestCore class is serving primarily as an implementation example for now.
Is it possible to merge while ignoring the functional tests? I've run them 4 times now, and different ones fail every time. Additionally, we should track down the race conditions that keep causing this. |
…sts to fail from high waits.
The tests were being run with this command: ``` cargo test --jobs 1 -- --ignored --nocapture ``` However it should be noted that `--jobs 1` only affects compilation, not the amount of concurrency. We were likely overloading the test system and causing timeouts. We could probably use 4-8 threads, but I'm setting this to 1 since that seems to have been the initial intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to also stop app services by setting replicas to zero
… setting their replica count to 0. Restarts will set this back to 1.
In order to cover PLAT-467, we needed to provide functionality to toggle the CNPG hibernation notation. This branch should accomplish that. Functional test included.