Skip to content

Conversation

@bhpratt
Copy link
Contributor

@bhpratt bhpratt commented May 3, 2024

Description

Currently the module agents only run on VPC infra clusters. Upon review, this is due to the name lookup. So, the main change is to add a block to allow the module to look up the name of a classic cluster:

data "ibm_container_cluster" "cluster" {
    count             = var.is_vpc == false ? 1 : 0
    name              = var.cluster_id
    resource_group_id = var.cluster_resource_group_id
  }

This implementation requires a new variable: is_vpc_cluster that will be toggled by the user. Default is true.

Additional changes were to add a new test case and update example to test a classic cluster.

Outstanding questions:

  • I included some new blocks and updated fields in examples/basic to avoid duplicating lots of terraform blocks in a new example. Is this ok, or should I split this out into examples/classic and duplicate the observability blocks?
  • Tests dynamically choose region based on the usage in the account for VPC. These regions don't match directly to classic data centers. Right now TestRunBasicAgentsClassic uses the default syd01 data center. Do I need to update this behavior?

Discussed above w/Conall. Will keep as is.

make run-tests-local
...
--- PASS: TestRunBasicAgentsKubernetes (1931.12s)
...
--- PASS: TestRunBasicAgentsClassic (2026.78s)
...
--- PASS: TestRunBasicAgents (2071.10s)
...
--- PASS: TestRunUpgrade (2160.04s)
PASS
ok  	github.com/terraform-ibm-modules/terraform-ibm-observability-agents	2161.337s

Release required?

  • No release
  • Patch release (x.x.X)
  • Minor release (x.X.x)
  • Major release (X.x.x)
Release notes content

Run the pipeline

If the CI pipeline doesn't run when you create the PR, the PR requires a user with GitHub collaborators access to run the pipeline.

Run the CI pipeline when the PR is ready for review and you expect tests to pass. Add a comment to the PR with the following text:

/run pipeline

Checklist for reviewers

  • If relevant, a test for the change is included or updated with this PR.
  • If relevant, documentation for the change is included or updated with this PR.

For mergers

  • Use a conventional commit message to set the release level. Follow the guidelines.
  • Include information that users need to know about the PR in the commit message. The commit message becomes part of the GitHub release notes.
  • Use the Squash and merge option.

@bhpratt bhpratt requested review from ocofaigh and shemau as code owners May 3, 2024 18:29
@bhpratt bhpratt changed the title add ablity to lookup name for classic cluster add ability to lookup name for classic cluster May 3, 2024
@bhpratt bhpratt marked this pull request as draft May 3, 2024 18:30
@bhpratt bhpratt changed the title add ability to lookup name for classic cluster Add support for IKS and ROKS clusters on classic infra May 8, 2024
@bhpratt bhpratt marked this pull request as ready for review May 9, 2024 15:09
@ocofaigh
Copy link
Contributor

/run pipeline

ocofaigh
ocofaigh previously approved these changes May 20, 2024
@ocofaigh
Copy link
Contributor

@bhpratt Have you tried to deploy locally on a classic cluster? The test failed with below error, but no way to know why unless you locally deploy and check cluster logs

TestRunBasicAgentsClassic 2024-05-20T19:05:56Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; ╷
│ Error: local-exec provisioner error
│ 
│   with module.observability_agents.helm_release.cloud_monitoring_agent[0],
│   on ../../main.tf line 203, in resource "helm_release" "cloud_monitoring_agent":
│  203:   provisioner "local-exec" {
│ 
│ Error running command '../../scripts/confirm-rollout-status.sh sysdig-agent
│ ibm-observe': exit status 1. Output: Waiting for daemon set "sysdig-agent"
│ rollout to finish: 0 of 1 updated pods are available...
│ error: timed out waiting for the condition
│ 
╵}

@bhpratt
Copy link
Contributor Author

bhpratt commented May 21, 2024

Yes - ran local tests. Ran again today and all succeeded, including TestRunBasicAgentsClassic:

ibm_container_cluster.cluster[0]: Destruction complete after 5m11s

Destroy complete! Resources: 10 destroyed.

TestRunBasicAgentsClassic 2024-05-21T11:31:45-04:00 tests.go:434: END: Destroy
--- PASS: TestRunBasicAgentsClassic (2557.21s)

I did notice that the classic cluster defaults to one node, so I added a field to the example to bump that to two nodes in case there was resource contention for the pods.

@ocofaigh
Copy link
Contributor

/run pipeline

@bhpratt
Copy link
Contributor Author

bhpratt commented May 22, 2024

This error:

TestRunBasicAgentsClassic 2024-05-21T21:50:47Z logger.go:66: ╷
TestRunBasicAgentsClassic 2024-05-21T21:50:47Z logger.go:66: │ Error: Request failed with status code: 400, ServerErrorResponse: {"incidentID":"3dd0360e-fd0d-4c24-9d68-6319c55e0306,3dd0360e-fd0d-4c24-9d68-6319c55e0306","code":"E6967","description":"You must specify a private VLAN for worker node connectivity.","type":"BadRequest"}

Happens when the private vlan doesn't get cleaned up quickly enough (auto-created vlans are also auto-deleted). IKS detects it and won't create a new VLAN. I can add some stanzas to manually create (and destroy) vlans.

@bhpratt
Copy link
Contributor Author

bhpratt commented May 23, 2024

Added static VLANs so VLAN clean-up will be controlled explicitly by terraform

@ocofaigh
Copy link
Contributor

/run pipeline

@bhpratt
Copy link
Contributor Author

bhpratt commented May 24, 2024

I can't re-create this error locally. I've run this 6-7 times w/no errors. Which makes me think there's some pipeline configuration that I'm not accounting for.

One thing I can think of - Sysdig containers sometimes relies on connectivity to sysdig.com (if it needs to pull down a pre-compiled kernel modules). LogDNA doesn't have this requirement. So far the pipeline is only failing on the Sysdig deployment - and only on the classic cluster test. Is there possibly any classic infra firewall or security groups in this pipeline account that could be impacting public connectivity for the classic clusters?

It appears to be timing out on the deploy of the sysdig agent:

TestRunBasicAgentsClassic 2024-05-23T16:12:34Z logger.go:66: module.observability_agents.helm_release.cloud_monitoring_agent[0]: Still creating... [20m10s elapsed]
...
Error: context deadline exceeded

I could try switching data centers from syd01 to something else, as well - but that would just be trying some shots in the dark.

I think I'm going to need some assistance to figure out this test failure.

@ocofaigh
Copy link
Contributor

/run pipeline

@ocofaigh
Copy link
Contributor

ocofaigh commented Jul 1, 2024

sysdig pods readiness failing with:
2024-07-01 16:19:13.396, 31465.51275, Error, endpoint:cm_collector_endpoint:775: connect():timeout: Timeout

Something seems to be blocking the connection to host:

# curl -vk https://ingest.private.br-sao.monitoring.cloud.ibm.com/
* Rebuilt URL to: https://ingest.private.br-sao.monitoring.cloud.ibm.com/
*   Trying 166.9.84.19...
* TCP_NODELAY set
* connect to 166.9.84.19 port 443 failed: Connection timed out
*   Trying 166.9.82.19...

Debugging..

@ocofaigh
Copy link
Contributor

ocofaigh commented Jul 1, 2024

/run pipeline

@ocofaigh
Copy link
Contributor

ocofaigh commented Jul 1, 2024

Had to set this to On in the account for classic cluster to be able to use private endpoint
image

@ocofaigh ocofaigh merged commit 67c47fb into terraform-ibm-modules:main Jul 1, 2024
@terraform-ibm-modules-ops
Copy link
Contributor

🎉 This issue has been resolved in version 1.27.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants