Skip to content

querier: per-endpoint configuration #4389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

Namanl2001
Copy link
Contributor

@Namanl2001 Namanl2001 commented Jun 30, 2021

The Thanos Querier component supports basic mTLS configuration for internal gRPC communication. This works great for basic use cases but it still requires extra forward proxies to make it work for bigger deployments.

This PR add support for per-endpoint TLS configuration in Thanos Query Component for internal gRPC communication. Here we are introducing a new CLI option --endpoints.config, which will accept the content/path to a yaml file which contains a list as follows:

- name: ""
  tls_config:
    cert_file: ""
    key_file: ""
    ca_file: ""
    server_name: ""
  endpoints: []
  endpoints_sd_files:
    - files: []
  mode: ""

Changes

Based on proposal #4377

  • Added new --endpoint.config cli option
  • Added new file pkg/store/config.go to load config from cli options (in list form)
  • Few changes in query.go to iterate over the list items

Verification

Added e2e tests for --endpoints.config with mutual TLS.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from our 1:1, great job!

Let's make sure to test it too. Is the plan to NOT to dynamic reload in this PR? (that's great idea BTW)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

As mentioned in our 1:1 wonder if we can help @hitanshu-mehta and help ourselves to do this work on #4421 and review his PR 🤔

@Namanl2001 Namanl2001 requested a review from bwplotka July 13, 2021 19:31
@Namanl2001 Namanl2001 changed the title wip-querier: per-endpoint configuration querier: per-endpoint configuration Jul 14, 2021
@Namanl2001
Copy link
Contributor Author

Namanl2001 commented Jul 25, 2021

I am stuck with testing mTLS in querier.

  • Without mTLS all e2e-tests are passing in this branch (including new tests written for per-endpoint config feature).

  • I've also tried enabling one way TLS (server side only OR client side only) but got the same error.

  • I've also tested mTLS on the main branch with same TLS credentials but tests are failing with the same err log as on this branch.

  • I've created certs using this script.

  • Error log is also not much helpful in this case. 😞

Edit: Also tried creating certificates with rsa.certiicate but no luck.

@bwplotka
Copy link
Member

14:00:33 Killing querier-1
14:00:33 querier-1: level=warn name=querier-1 ts=2021-07-28T14:00:33.424080098Z caller=storeset.go:574 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from e2e_test_query_config-sidecar-alone:9091: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"" address=e2e_test_query_config-sidecar-alone:9091
14:00:33 Killing sidecar-ha2
14:00:34 Killing prometheus-ha2
14:00:34 Killing sidecar-ha1
14:00:34 Killing prometheus-ha1
14:00:34 Killing sidecar-remote-and-sidecar
14:00:35 Killing prometheus-remote-and-sidecar
14:00:35 Killing sidecar-alone
=== CONT  TestQueryWithEndpointConfig
Error:     query_test.go:283: query_test.go:283:
        
         unexpected error: unable to find metrics [thanos_store_nodes_grpc_connections] with expected values. Last error: <nil>. Last values: [4]

@Namanl2001
Copy link
Contributor Author

Namanl2001 commented Jul 30, 2021

Finally I was able to test mutual TLS in querier. Earlier I was configuring tls, only in querier for both client and server. Which was wrong as in our case sidecar is the server and after configuring the same in sidecar it worked 😄.

This PR needs another round of review.

PS: Failing tests seems unrelated as all tests are passing locally.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good, just some nits.

@Namanl2001 Namanl2001 requested a review from bwplotka August 11, 2021 11:00
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Only a few wrinkles to iron out!

bwplotka
bwplotka previously approved these changes Aug 17, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks good to me, except for one part. Added suggestion what we can do to ensure single storeset. Otherwise LGTM! 💪🏽

@@ -422,94 +472,89 @@ func runQuery(
dialOpts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I wrong if I would say that if we somehow pass through spec the dial options per addresses we could have just one storeset right? Do you think it would be worth doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

if we somehow pass through spec the dial options per addresses we could have just one storeset right?

But dialopts are different for each endpoint config (as it depends on config.tlsConfig). So even if we somehow pass it through spec, we would not have one storeset as storeset would be defined inside the loop to have all tlsConfig.

To have only one storeset I thought to take the storeset initialization outside the loop but we can't because it uses dnsProvider which needs to be different for each endpoint config else it gives error.

So we need to have storeset inside the loop (which means more than one storeset) to have different dnsProvider and tlsConfig corresponding to different endpoint sets.

Do you think it would be worth doing?

As discussed with @kakkoyun (in our 1:1) we expect only few configs (like 4 or 5) from the user so having these much of storesets would not affect the performance much. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But dialopts are different for each endpoint config (as it depends on config.tlsConfig). So even if we somehow pass it through spec, we would not have one storeset as storeset would be defined inside the loop to have all tlsConfig.

Yea, I think we could do that per addresses.

As discussed with @kakkoyun (in our 1:1) we expect only few configs (like 4 or 5) from the user so having these much of storesets would not affect the performance much. :)

I know companies running Thanos against 300 of StoreAPIs, so I would be careful here. But also I am not worried about performance too much here, rather maintainability and debuggability when you have different metrics per 100 of those storesets potentially. And different health loops. Not a biggie, maybe I can try propose a PR to yours if we want to compare different solutions? (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try propose a PR to yours if we want to compare different solutions

That would be really helpful 🤗

grpcProbe,
prober.NewInstrumentation(comp, logger, extprom.WrapRegistererWithPrefix("thanos_", reg)),
var (
// Adding separate for loop for each client func() below because storeSets is being populated in a go-routine and this code executes before it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is not the best. I mentioned Idea how we can ensure single storeset

@Namanl2001
Copy link
Contributor Author

Namanl2001 commented Oct 1, 2021

rebased successfully to main 😵‍💫

Further tasks:

  • move endpointConfig to query/config.go as it's about all APIs not only store
  • let's try to have one function that loads the config (if non empty) and merge flags into one endpoint config - might be cleaner
  • try to have one endpoint set in cmd/thanos/query.go

Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
Signed-off-by: Namanl2001 <namanlakhwani@gmail.com>
@Namanl2001
Copy link
Contributor Author

This PR works fine (see approval) but there is a suggestion to ensure a single storeset. The work for which would be carried out on another PR.

This was my Google Summer of Code 2021 project. ❤️

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants