Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CNPG Hibernation functionality #720

Merged
merged 18 commits into from
Apr 23, 2024
Merged

Add support for CNPG Hibernation functionality #720

merged 18 commits into from
Apr 23, 2024

Conversation

bonesmoses
Copy link
Contributor

In order to cover PLAT-467, we needed to provide functionality to toggle the CNPG hibernation notation. This branch should accomplish that. Functional test included.

bonesmoses and others added 6 commits April 12, 2024 16:28
… from

following the documentation for the CNPG declarative hibernation feature:

https://cloudnative-pg.io/documentation/current/declarative_hibernation/

This introduces the cnpg.io/hibernation annotation and sets it to "on" when
the spec.stop attribute is true. Currently this only circumvents reconcile
blocks which depend on the cluster to be running in some way, but I suspect
it would be possible to be more aggressive depending on how much extra
operator overhead we want to remove. This reactivates the "deprecated" code
surrounding the spec.stop attribute; former attempts to use this flag resulted
in a cluster permanently stuck in a warning state.
…ster.

Also fixed a possible bug where the pods_to_fence variable masks a
function of the same name.

Unfortunately this will not work as written. Once we unset the "stop" flag,
all of our existing circumvention will fail to restart the cluster because
of the current fencing code. The previous spec.stop state will be lost, and
the operator will immediately try to fence pods that don't exist. This means
the fencing code needs to check for the hibernation flag. This code will be
added later.
Fixed the issue where the cluster can't thaw after spec.stop is
set to false. The issue was that pods_to_fence is designed to be
called before cnpg_cluster_from_cdb where we toggle the hibernation
annotation. This function implicitly seeks a Primary pod, but since
there are no pods, it simply enters a permanent retry loop while
expecting the Primary pod to appear. Now we check directly for the
hibernation annotation here and in reconcile_cnpg to guarantee a
code path to the annotation.
Added a test for the new hibernation logic. This test will create a new
cluster, put it into hibernation state, and remove hibernation, while
verifying each.

I also added a simple class to help us eventually remove a lot of the
boilerplate code all the tests currently seem to include. The
functional_test_hibernate test is an example implementation.
tembo-operator/src/controller.rs Show resolved Hide resolved
tembo-operator/src/controller.rs Outdated Show resolved Hide resolved
tembo-operator/src/controller.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, cleanly done. I like how the new code is clearly separated from existing. It seems to have bare-minimum touching of the existing code, which is perfect, i.e. the minimum number of lines modified to get the job done, and new code is in new functions. Beautiful!

tembo-operator/src/cloudnativepg/cnpg.rs Outdated Show resolved Hide resolved
tembo-operator/src/controller.rs Outdated Show resolved Hide resolved
tembo-operator/src/controller.rs Show resolved Hide resolved
tembo-operator/tests/integration_tests.rs Outdated Show resolved Hide resolved
bonesmoses and others added 7 commits April 18, 2024 15:34
Steve fixed a typo!

Co-authored-by: Steven Miller <sjmiller609@gmail.com>
Converted the basic-pgsql test to use the TestCore base struct. A focused invocation shows this test is still functional post conversion.

Removed state field from TestCore struct, as it's just a means of obtaining an Arc context.

Further investigation shows the "pods" attribute almost never comes without a named pod instance, but it is used in isolation in nested calls. Later refactors should account for this and add another field to the struct for this named pod instance (usually named {cluster-name}-1). The TestCore class is serving primarily as an implementation example for now.
@bonesmoses
Copy link
Contributor Author

bonesmoses commented Apr 19, 2024

Is it possible to merge while ignoring the functional tests? I've run them 4 times now, and different ones fail every time. Additionally, we should track down the race conditions that keep causing this.

The tests were being run with this command:

```
cargo test --jobs 1 -- --ignored --nocapture
```

However it should be noted that `--jobs 1` only affects compilation, not the amount of concurrency. We were likely overloading the test system and causing timeouts. We could probably use 4-8 threads, but I'm setting this to 1 since that seems to have been the initial intent.
Copy link
Contributor

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to also stop app services by setting replicas to zero

sjmiller609 and others added 2 commits April 22, 2024 15:27
… setting their replica count to 0. Restarts will set this back to 1.
@bonesmoses bonesmoses merged commit 67f1ed0 into main Apr 23, 2024
9 checks passed
@bonesmoses bonesmoses deleted the plat-467 branch April 23, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants