Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait until ingress is ready #747

Closed
vburckhardt opened this issue Mar 19, 2024 · 8 comments
Closed

Wait until ingress is ready #747

vburckhardt opened this issue Mar 19, 2024 · 8 comments
Labels
bug 🐞 Something isn't working internal-synced

Comments

@vburckhardt
Copy link
Member

Description

Some reports that clusters are not fully ready after install terraform-ibm-modules/terraform-ibm-mas#24

There is already a condition at

default = "IngressReady"

Possibly porting https://github.com/terraform-ibm-modules/terraform-ibm-base-ocp-vpc/blob/404cb937a0d9f9e558c77c8852a5f88b3ff21d84/main.tf#L383 may be sufficient, to investigate.

New or affected modules


By submitting this issue, you agree to follow our Code of Conduct

@ocofaigh
Copy link
Member

In order to run the script at https://github.com/terraform-ibm-modules/terraform-ibm-base-ocp-vpc/blob/404cb937a0d9f9e558c77c8852a5f88b3ff21d84/main.tf#L383 in landing zone you would need access to the private network to be able to connect to cluster with kubectl since public endpoint is disabled by default. There may also be other network restrictions in place that might block that connection?

@ocofaigh
Copy link
Member

ocofaigh commented May 8, 2024

We think think this is why many of the tests used in the MAS PR test pipeline fail intermittently too - seems the cluster is not fully ready.

@ocofaigh
Copy link
Member

ocofaigh commented May 9, 2024

I do believe we can easily recreate this. Even after the cluster ingress goes green, the master seems to then get restarted to configure KMS encryption:
image
image

After reaching out to IKS, this seems to be a known issue:

ROKS version 4.14 and later will see various API services fail discovery during a restart of the Kubernetes API server restart.

There is an internal issue tracking this with IKS: https://github.ibm.com/alchemy-containers/armada-network/issues/8843

Meanwhile, perhaps we could add extra checks on the master status AFTER the ingress has gone green?

@ocofaigh
Copy link
Member

ocofaigh commented May 9, 2024

I tried running the confirm_network_healthy.sh script but it didn't help. The Ingress is green and the script passed after 2 seconds however the master status was "Key management service enablement in progress".
I think ibm_container_vpc_cluster will need to be updated so there is a new option for wait_till to return after masters have been updated after Key management service enablement is complete.
I have raised an IKS support case

@ocofaigh
Copy link
Member

Things seem alot better in OCP 4.14. They fixed the issue I quoted above and confirmed the following:

So, if this happened during KMS update, it might have failed attempting to create the redhat-marketplace namespace, because openshift-pipelines has an admission webhook for namespaces. If the cluster is 4.13 or earlier we know there is a large window during a master update (kms, patch update,master refresh, etc) - at one point around 10 minutes - where webhooks, aggregated APIs, exec and logs requests, etc will randomly fail.
That's a fact of life for 4.13 and earlier (which use an OpenVPN tunnel) and why 4.14 and later use Konnectivity.

@ocofaigh
Copy link
Member

IKS team has confirmed there is an issue with the terraform provider marking the cluster as ready too early. This is what they said:

Hello Conall,

Thank you for your patience. Our internal team was able to replicate the issue and they are working towards a resolution. From our side, we would need to make some changes to the terraform provider to fix the issue that you are experiencing. The development of the solution is still in the early phases. We will provide you an update next week.

@ocofaigh
Copy link
Member

This might help:

our internal team has found the following:
the terraform plugin has a new field in kms_config called wait_for_apply, it is a bool.
If set to true, terraform will wait until the KMS is applied to the master, ready and deployed.

Going to test it out

@ocofaigh
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working internal-synced
Projects
None yet
Development

No branches or pull requests

3 participants