Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flaky-test] Certificate Rotation #45577

Open
bigkevmcd opened this issue May 22, 2024 · 7 comments
Open

[flaky-test] Certificate Rotation #45577

bigkevmcd opened this issue May 22, 2024 · 7 comments
Assignees
Labels
kind/flaky-test priority/0 QA/None Indicates that the task or issue does not need QA. status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@bigkevmcd
Copy link
Contributor

Flaky Test

=== RUN   Test_Operation_SetA_Custom_CertificateRotation
time="2024-05-22T13:56:19Z" level=info msg="Running in single server mode, will not peer connections"
    certificaterotation.go:43: cluster test-custom-certificate-rotation-operations rotate certificates wait failed on: rotate certificates wait did not succeed : timeout waiting condition: context deadline exceeded
        cluster test-custom-certificate-rotation-operations test data bundle:
--- FAIL: Test_Operation_SetA_Custom_CertificateRotation (1201.38s)

func Test_Operation_SetA_Custom_CertificateRotation(t *testing.T) {

Release Branch
source branch of PR: release/v2.8

Drone Build (if applicable)

pipeline stage URL:

https://drone-pr.rancher.io/rancher/rancher/39110/4/2

@andreas-kupries
Copy link
Contributor

While I am not seeing the flakiness of the cert rotation I see a more general breakage reported around cluster provisioning, it seems.

This is for a PR sitting on release/v2.9: #45269

In case it matters, a local k3s-based Rancher starts up just fine for that PR.

My latest drone logs are at

The message is generally the same across various failures:

... failed on: prov cluster is not ready: timeout waiting condition: context deadline exceeded

Question: Is each test creating its own cluster ? And removing it later ?
Because I see that there are passing tests too.

I see only 2 tests fail in in each of the 5 provisioning stages, mostly different across the stages.

Failing: Test_Provisioning_Custom_OneNodeWithDelete, Test_Provisioning_MP_SingleNodeAllRolesWithDelete, Test_Provisioning_Custom_ThreeNode, Test_Operation_SetA_Custom_CertificateRotation, Test_Operation_SetA_MP_CertificateRotation, Test_Operation_SetB_Custom_EtcdSnapshotOperationsOnNewCombinedNode, Test_Operation_SetB_MP_EtcdSnapshotOperationsWithThreeEtcdNodesOnNewNode

In the build-pr failures I see the same failed on ... message, after the unit tests were run and passed.

@andreas-kupries
Copy link
Contributor

Created a PR without material code changes (comment fix).
Seeing failed builds there too, see https://drone-pr.rancher.io/rancher/rancher/39168
However none of the context deadline exceeded from my branch :(
Now wondering if the addition of the status field, and its handling slowed something down enough to trigger these timeouts.

@samjustus samjustus added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label May 28, 2024
@bigkevmcd
Copy link
Contributor Author

Yeah, I'm still getting Test_Operation_SetB_MP_EtcdSnapshotOperationsWithThreeEtcdNodesOnNewNode failing.

For the nature of the change in this PR #45572, this should just not be affected.

@Oats87
Copy link
Contributor

Oats87 commented May 31, 2024

Doing investigation into this, I'm seeing that there were issues with operations taking significantly longer after v1.27.11+rke2r1 was released. It was almost a 200 second difference in my benchmark setup.

As such, as a temporary workaround, we can pin the RKE2 version to v1.27.10+rke2r1 for now, which should hopefully unblock CI.

@snasovich
Copy link
Collaborator

Adding to a milestone and some additional labels to ensure we circle back on this and address version pinning.

@slickwarren
Copy link
Contributor

no QA required - closing this issue

@snasovich
Copy link
Collaborator

Reopening to track unpinning per #45577 (comment). Changing milestone so it doesn't appear it's blocking 2.9.0 release.

@snasovich snasovich reopened this Jul 3, 2024
@snasovich snasovich modified the milestones: v2.9.0, v2.9-Next2 Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flaky-test priority/0 QA/None Indicates that the task or issue does not need QA. status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

7 participants