Add support for IKS and ROKS clusters on classic infra #336

bhpratt · 2024-05-03T18:29:44Z

Description

Currently the module agents only run on VPC infra clusters. Upon review, this is due to the name lookup. So, the main change is to add a block to allow the module to look up the name of a classic cluster:

data "ibm_container_cluster" "cluster" {
    count             = var.is_vpc == false ? 1 : 0
    name              = var.cluster_id
    resource_group_id = var.cluster_resource_group_id
  }

This implementation requires a new variable: is_vpc_cluster that will be toggled by the user. Default is true.

Additional changes were to add a new test case and update example to test a classic cluster.

Outstanding questions:

I included some new blocks and updated fields in examples/basic to avoid duplicating lots of terraform blocks in a new example. Is this ok, or should I split this out into examples/classic and duplicate the observability blocks?
Tests dynamically choose region based on the usage in the account for VPC. These regions don't match directly to classic data centers. Right now TestRunBasicAgentsClassic uses the default syd01 data center. Do I need to update this behavior?

Discussed above w/Conall. Will keep as is.

make run-tests-local
...
--- PASS: TestRunBasicAgentsKubernetes (1931.12s)
...
--- PASS: TestRunBasicAgentsClassic (2026.78s)
...
--- PASS: TestRunBasicAgents (2071.10s)
...
--- PASS: TestRunUpgrade (2160.04s)
PASS
ok  	github.com/terraform-ibm-modules/terraform-ibm-observability-agents	2161.337s

Release required?

No release
Patch release (x.x.X)
Minor release (x.X.x)
Major release (X.x.x)

Release notes content

Run the pipeline

If the CI pipeline doesn't run when you create the PR, the PR requires a user with GitHub collaborators access to run the pipeline.

Run the CI pipeline when the PR is ready for review and you expect tests to pass. Add a comment to the PR with the following text:

/run pipeline

Checklist for reviewers

If relevant, a test for the change is included or updated with this PR.
If relevant, documentation for the change is included or updated with this PR.

For mergers

Use a conventional commit message to set the release level. Follow the guidelines.
Include information that users need to know about the PR in the commit message. The commit message becomes part of the GitHub release notes.
Use the Squash and merge option.

variables.tf

ocofaigh · 2024-05-20T18:00:55Z

/run pipeline

ocofaigh · 2024-05-20T19:44:25Z

@bhpratt Have you tried to deploy locally on a classic cluster? The test failed with below error, but no way to know why unless you locally deploy and check cluster logs

TestRunBasicAgentsClassic 2024-05-20T19:05:56Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; ╷
│ Error: local-exec provisioner error
│ 
│   with module.observability_agents.helm_release.cloud_monitoring_agent[0],
│   on ../../main.tf line 203, in resource "helm_release" "cloud_monitoring_agent":
│  203:   provisioner "local-exec" {
│ 
│ Error running command '../../scripts/confirm-rollout-status.sh sysdig-agent
│ ibm-observe': exit status 1. Output: Waiting for daemon set "sysdig-agent"
│ rollout to finish: 0 of 1 updated pods are available...
│ error: timed out waiting for the condition
│ 
╵}

bhpratt · 2024-05-21T17:55:25Z

Yes - ran local tests. Ran again today and all succeeded, including TestRunBasicAgentsClassic:

ibm_container_cluster.cluster[0]: Destruction complete after 5m11s

Destroy complete! Resources: 10 destroyed.

TestRunBasicAgentsClassic 2024-05-21T11:31:45-04:00 tests.go:434: END: Destroy
--- PASS: TestRunBasicAgentsClassic (2557.21s)

I did notice that the classic cluster defaults to one node, so I added a field to the example to bump that to two nodes in case there was resource contention for the pods.

ocofaigh · 2024-05-21T21:41:50Z

/run pipeline

bhpratt · 2024-05-22T12:51:53Z

This error:

TestRunBasicAgentsClassic 2024-05-21T21:50:47Z logger.go:66: ╷
TestRunBasicAgentsClassic 2024-05-21T21:50:47Z logger.go:66: │ Error: Request failed with status code: 400, ServerErrorResponse: {"incidentID":"3dd0360e-fd0d-4c24-9d68-6319c55e0306,3dd0360e-fd0d-4c24-9d68-6319c55e0306","code":"E6967","description":"You must specify a private VLAN for worker node connectivity.","type":"BadRequest"}

Happens when the private vlan doesn't get cleaned up quickly enough (auto-created vlans are also auto-deleted). IKS detects it and won't create a new VLAN. I can add some stanzas to manually create (and destroy) vlans.

bhpratt · 2024-05-23T14:25:19Z

Added static VLANs so VLAN clean-up will be controlled explicitly by terraform

ocofaigh · 2024-05-23T15:22:47Z

/run pipeline

bhpratt · 2024-05-24T19:01:06Z

I can't re-create this error locally. I've run this 6-7 times w/no errors. Which makes me think there's some pipeline configuration that I'm not accounting for.

One thing I can think of - Sysdig containers sometimes relies on connectivity to sysdig.com (if it needs to pull down a pre-compiled kernel modules). LogDNA doesn't have this requirement. So far the pipeline is only failing on the Sysdig deployment - and only on the classic cluster test. Is there possibly any classic infra firewall or security groups in this pipeline account that could be impacting public connectivity for the classic clusters?

It appears to be timing out on the deploy of the sysdig agent:

TestRunBasicAgentsClassic 2024-05-23T16:12:34Z logger.go:66: module.observability_agents.helm_release.cloud_monitoring_agent[0]: Still creating... [20m10s elapsed]
...
Error: context deadline exceeded

I could try switching data centers from syd01 to something else, as well - but that would just be trying some shots in the dark.

I think I'm going to need some assistance to figure out this test failure.

ocofaigh · 2024-06-26T08:14:45Z

/run pipeline

ocofaigh · 2024-07-01T16:47:15Z

sysdig pods readiness failing with:
2024-07-01 16:19:13.396, 31465.51275, Error, endpoint:cm_collector_endpoint:775: connect():timeout: Timeout

Something seems to be blocking the connection to host:

# curl -vk https://ingest.private.br-sao.monitoring.cloud.ibm.com/
* Rebuilt URL to: https://ingest.private.br-sao.monitoring.cloud.ibm.com/
*   Trying 166.9.84.19...
* TCP_NODELAY set
* connect to 166.9.84.19 port 443 failed: Connection timed out
*   Trying 166.9.82.19...

Debugging..

ocofaigh · 2024-07-01T19:28:27Z

/run pipeline

ocofaigh · 2024-07-01T19:31:00Z

Had to set this to On in the account for classic cluster to be able to use private endpoint

terraform-ibm-modules-ops · 2024-07-01T20:25:45Z

🎉 This issue has been resolved in version 1.27.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

add ablity to lookup name for classic cluster

26706ae

bhpratt requested review from ocofaigh and shemau as code owners May 3, 2024 18:29

bhpratt changed the title ~~add ablity to lookup name for classic cluster~~ add ability to lookup name for classic cluster May 3, 2024

bhpratt marked this pull request as draft May 3, 2024 18:30

bhpratt changed the title ~~add ability to lookup name for classic cluster~~ Add support for IKS and ROKS clusters on classic infra May 8, 2024

clarify is_vpc usage

c6cfbfc

ocofaigh reviewed May 8, 2024

View reviewed changes

variables.tf Outdated Show resolved Hide resolved

bhpratt added 7 commits May 8, 2024 12:04

update variable to is_vpc_cluster

0910c33

add test case and update example for classic clusters

1bd57af

rename rg

20bf958

fix variable declaration in test cases

48ef6f0

add back doc and update readme with new flag

2ca0380

fix typo

bc4431e

format and doc updates from pre-commit

7798f4b

bhpratt marked this pull request as ready for review May 9, 2024 15:09

bhpratt added 3 commits May 9, 2024 12:26

reverting to public endpoint for example cluster and adding skip check

d89b532

Merge branch 'main' into main

9fa2367

Merge branch 'main' into main

509f6d4

ocofaigh previously approved these changes May 20, 2024

View reviewed changes

increase example classic cluster to 2 nodes

8cac9f1

bhpratt dismissed ocofaigh’s stale review via 8cac9f1 May 21, 2024 17:53

add static vlans for classic cluster example

642b0aa

bhpratt and others added 2 commits May 28, 2024 08:45

Merge branch 'main' into main

624c5f8

Merge branch 'main' into main

81acb8b

Merge branch 'main' into main

18aa982

ocofaigh approved these changes Jul 1, 2024

View reviewed changes

ocofaigh merged commit 67c47fb into terraform-ibm-modules:main Jul 1, 2024

terraform-ibm-modules-ops added the released label Jul 1, 2024

Add support for IKS and ROKS clusters on classic infra #336

Add support for IKS and ROKS clusters on classic infra #336

Uh oh!

Conversation

bhpratt commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release required?

Release notes content

Run the pipeline

Checklist for reviewers

For mergers

Uh oh!

Uh oh!

ocofaigh commented May 20, 2024

Uh oh!

ocofaigh commented May 20, 2024

Uh oh!

bhpratt commented May 21, 2024

Uh oh!

ocofaigh commented May 21, 2024

Uh oh!

bhpratt commented May 22, 2024

Uh oh!

bhpratt commented May 23, 2024

Uh oh!

ocofaigh commented May 23, 2024

Uh oh!

bhpratt commented May 24, 2024

Uh oh!

ocofaigh commented Jun 26, 2024

Uh oh!

ocofaigh commented Jul 1, 2024

Uh oh!

ocofaigh commented Jul 1, 2024

Uh oh!

ocofaigh commented Jul 1, 2024

Uh oh!

terraform-ibm-modules-ops commented Jul 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bhpratt commented May 3, 2024 •

edited

Loading