Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use thanos default port in service and containerPort #414

Merged
merged 4 commits into from Mar 12, 2024

Conversation

jan--f
Copy link
Collaborator

@jan--f jan--f commented Feb 8, 2024

Before thanos was only listening on localhost. Relying on the default setting allows user to modify that via server side apply and add a proxy chain.

Fixes: https://issues.redhat.com/browse/COO-14

@jan--f jan--f requested a review from a team as a code owner February 8, 2024 11:31
@jan--f jan--f requested review from danielmellado and sthaha and removed request for a team February 8, 2024 11:31
@openshift-ci openshift-ci bot added the approved label Feb 8, 2024
@danielmellado
Copy link
Contributor

/lgtm any reason on why it was only listening on localhost before? I assume it was proxied but it'd be good to clarify this on the PR message.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

No reason, just a mistake :)

@danielmellado
Copy link
Contributor

hmm it does seem that tests still are targeting the 9090 port E0208 11:47:09.149727 35900 portforward.go:409] an error occurred forwarding 9090 -> 9090: error forwarding port 9090 to pod 4d6e4b0f907b07e5f24151733d584f37bb83a6b9bc2483e568a8eaae49b9eb99, uid : failed to execute portforward in network namespace" so this would probably have to be changed as well.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

Ok at least my local test failure reproduces here too.

E0208 11:47:09.149727   35900 portforward.go:409] an error occurred forwarding 9090 -> 9090: error forwarding port 9090 to pod 4d6e4b0f907b07e5f24151733d584f37bb83a6b9bc2483e568a8eaae49b9eb99, uid : failed to execute portforward in network namespace "/var/run/netns/cni-d3ec7578-2ca0-8f70-59dd-2a9a054707e9": failed to connect to localhost:9090 inside namespace "4d6e4b0f907b07e5f24151733d584f37bb83a6b9bc2483e568a8eaae49b9eb99", IPv4: dial tcp4 127.0.0.1:9090: connect: connection refused IPv6 dial tcp6 [::1]:9090: connect: connection refused 
panic: lost connection to pod

Not sure what's happening here yet.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

hmm it does seem that tests still are targeting the 9090 port E0208 11:47:09.149727 35900 portforward.go:409] an error occurred forwarding 9090 -> 9090: error forwarding port 9090 to pod 4d6e4b0f907b07e5f24151733d584f37bb83a6b9bc2483e568a8eaae49b9eb99, uid : failed to execute portforward in network namespace" so this would probably have to be changed as well.

I don't think so. This error happens in TestMonitoringStackController/Verify_multi-namespace_support and the port forward is to prometheus. Here we change the thanos service port.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

Or am I just not smart enough to read the output correctly 🤔

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

Or am I just not smart enough to read the output correctly 🤔

Sure looks that way. Testing locally now since here it was failing in a prometheus test.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 8, 2024

Ok this is the error I'm seeing locally too:

=== RUN   TestMonitoringStackController/Verify_multi-namespace_support
E0208 14:22:19.405068   28162 portforward.go:409] an error occurred forwarding 9090 -> 9090: error forwarding port 9090 to pod b29bce2e1202569aeef6862fdfe16660317d86f50ea82774ee9b1177a236ec14, uid : failed to execute portforward in network namespace "/var/run/netns/cni-a601776d-003d-7bbe-fde4-a5b9daa71b03": failed to connect to localhost:9090 inside namespace "b29bce2e1202569aeef6862fdfe16660317d86f50ea82774ee9b1177a236ec14", IPv4: dial tcp4 127.0.0.1:9090: connect: connection refused IPv6 dial tcp6 [::1]:9090: connect: connection refused 
panic: lost connection to pod

Port 9090 is correct in this context and this code didn't change.

@danielmellado
Copy link
Contributor

hmm it does seem that tests still are targeting the 9090 port E0208 11:47:09.149727 35900 portforward.go:409] an error occurred forwarding 9090 -> 9090: error forwarding port 9090 to pod 4d6e4b0f907b07e5f24151733d584f37bb83a6b9bc2483e568a8eaae49b9eb99, uid : failed to execute portforward in network namespace" so this would probably have to be changed as well.

I don't think so. This error happens in TestMonitoringStackController/Verify_multi-namespace_support and the port forward is to prometheus. Here we change the thanos service port.

Yep, I was a bit too quick into this as the port was just matching the one you changed over there, but after checking the tests code this should be unrelated.

@danielmellado
Copy link
Contributor

@sthaha mind taking a look at this? thanks!

@jan--f jan--f force-pushed the fix-thanos-query-service branch 5 times, most recently from d5ffd5a to bae78cf Compare February 21, 2024 12:44
@jan--f
Copy link
Collaborator Author

jan--f commented Feb 21, 2024

I have disabled the test for evicting alertmanager pods under the PDB for now. I'll keep investigating this, but I also wonder whether this is a test we need to run. I'd argue we test k8s api here and that is out of scope for this operator.

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 21, 2024

/retest

@lihongyan1
Copy link
Contributor

/retest

1 similar comment
@jan--f
Copy link
Collaborator Author

jan--f commented Feb 26, 2024

/retest

@jan--f
Copy link
Collaborator Author

jan--f commented Feb 27, 2024

Ok this seems to be fairly stable now. A quick outline about the commits included here:

  • e4d9b94 is the actual fix I was after. Use the thanos default port and alter the Service accordingly. This allows users to change the port via SSA too.
  • 2e0cfac contains a few test refactors that shouldn't be too contentious. Mostly replacing wait.Poll calls with wait.PollUntilContextTimeout, waiting a bit longer here and there, as well as some test helpers that are make tests more resilient. E.g. AssertResourceAbsent can be more appropriate then AssertResourceNeverExists. assertPDBExpectedPodsAreHealthy is a small gate that waits for a PDB resrouce to be present and ready to go.
  • 358f089 is a small change to ./test/run-e2e.sh to accept a test regex that gets pass to go test -run so we can limit the tests that a run.
  • a41a42b updates kind to 0.20.0 in the github action

Before thanos was only listening on localhost. Relying on the default
setting allows user to modify that via server side apply and add a proxy
chain.

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
Mostly this is replacing deprecated functions and attempting to reduce
timing dependent behavior by waiting for expected resources to show up.

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
Also add the -count 1 flag (so results are never cached) and drop the
--retain flag (not sure what that ever did).

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
@lihongyan1
Copy link
Contributor

/retest

2 similar comments
@lihongyan1
Copy link
Contributor

/retest

@lihongyan1
Copy link
Contributor

/retest

Copy link

openshift-ci bot commented Mar 8, 2024

@jan--f: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/observability-operator-e2e 988ca25 link true /test observability-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -154,7 +155,7 @@ run_e2e() {
watch_obo_errors "$obo_error_log" &

local ret=0
go test -v -failfast -timeout $TEST_TIMEOUT ./test/e2e/... --retain=true | tee "$LOGS_DIR/e2e.log" || ret=1
go test -v -failfast -timeout $TEST_TIMEOUT ./test/e2e/... -run "$RUN_REGEX" -count 1 | tee "$LOGS_DIR/e2e.log" || ret=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat on the addition on the regex!

@danielmellado
Copy link
Contributor

/lgtm

@danielmellado
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Mar 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielmellado, jan--f

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [danielmellado,jan--f]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jan--f jan--f merged commit 2d6c82b into rhobs:main Mar 12, 2024
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants