querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

bwplotka · 2019-04-25T14:08:05Z

Currently if someone pass domain to --store of via file SD without dns loopups, the target is passed to gRPC DialContext which then if A lookup gives more than one response it loadbalance. However the default seems to be pickfirst not roundrobin, which might be confusing for some cases, like e.g smoothing the traffic to multiple replicas of store gateways.

AC:

Add another --store and fileSD option to configure gRPC LB types

The text was updated successfully, but these errors were encountered:

bwplotka · 2019-04-25T14:08:17Z

cc @devnev @mjd95

povilasv · 2019-05-03T07:33:13Z

IMO we shouldnt really do this. I had a chat with grpc folks about 30min dns refresh interval hardcoded into go grpc package and they said PRs changing that arent welcome. And it doesnt make sense for them to implement all of the LB strategies in all the languages. ATM its easier to run envoy, nginx, etc to do this. In future they are working on grpc lb server

povilasv · 2019-05-03T07:42:54Z

This is the snippet from our chat:

Povilas Versockas @povilasv Mar 12 12:56
Hey, guys I've been looking at GRPC loadbalancing on Kubernetes and I would like to go with go-grpc client looking up DNS records and doing round robin loadbalancing. The only thing worries me is that right now the default DNS refresh interval is 30 mins (https://github.com/grpc/grpc-go/blob/master/resolver/dns/dns_resolver.go#L47) and there is no way to change it. Would you guys bee open for a PR, which makes this configureable? I was thinking to add option WithRefreshInterval(time.Duration) DialOption. Or do you guys generally recommend proxy solutions (like envoy) doing load balancing in K8s?

Eric Anderson @ejona86 Mar 12 17:20
@povilasv, please no. Even the current interval would like to be removed. Instead, use KeepaliveParams with MaxConnectionAge. When the client receives disconnects, it will refresh DNS.

Povilas Versockas @povilasv Mar 12 19:20
@ejona86 but what about the case where I scale up my replicas? in this case clients will never know about new servers for 30 mins?
IMO in ideal scenario grpc LB pool should be refreshed based on DNS record's TTL, no?

Eric Anderson @ejona86 Mar 12 20:05
@povilasv, you set the MaxConnectionAge as the longest you are willing to wait for the clients to rebalance. It works for scaling up.
TTL causes trouble because 1) we can't get it many times and 2) it causes stampeding herds when caching DNS servers are in play.
Note that most existing HTTP-based systems do not base anything off TTL. Instead, they just create new connections frequently.

Edgar Romero @alfasapy Mar 12 20:10
@ejona86 what are the drawbacks of a very short maxConnectionAge? Assuming that the maxConnectionAgeGrace is high enough to prevent in-flight request from being cancelled.

Eric Anderson @ejona86 Mar 12 20:12
It just creates more connections. At O(seconds) and definitely O(minutes) the connection cost is already amortized. There will be "hickups" of more latency when creating new connections.

Povilas Versockas @povilasv Mar 12 21:23
ok makes sense, thank you :)
I feel like this should be documented better, is there a place I could do that?

Eric Anderson @ejona86 Mar 12 22:17
It's not clear to me where that place would be.

Povilas Versockas @povilasv Mar 12 22:22
ok, thanks for the help, I really appreciate it

povilasv · 2019-05-03T07:44:24Z

IMO pickfirst is pretty good LB as default, Grafana -> Thanos Query -> Thanos Store becomes sticky per user, meaning if grafana refreshes and send the query it will be in cache already.

bwplotka · 2019-05-09T10:14:51Z

I don't think it is good as it does not evenly loadbalance the load. At least we might want to add option of changing this for first iteration..

Wonder if some more customized LB makes sense then? That is aware of client context?

povilasv · 2019-05-16T08:34:31Z

The only proper way to evenly balance the load and don't deal with upscale issues is to run sidecar proxies.

The only other option is to add maxConnectionAge to Thanos Store, Sidecar and Rule GRPC servers, which is super hacky and I wouldn't like that.

AFAIK the grpc lookaside loadbalancer, where grpc client would actually connect to lookaside server get it's routes from there and keep refreshing them is still in development

https://grpc.io/blog/loadbalancing/

So we don't really have any good options currently, a part from asking users to use sidecar proxies.

povilasv · 2019-05-16T13:20:10Z

hmm maybe it's not that bad we could do something like on all of our grpc servers:

	grpcServer := grpc.NewServer(
		grpc.UnaryInterceptor(grpc_middleware.ChainUnaryServer(
			grpc_prometheus.UnaryServerInterceptor,
			grpc_recovery.UnaryServerInterceptor(),
			grpc_logrus.UnaryServerInterceptor(log.NewEntry(log.StandardLogger())),
		)),
		grpc.KeepaliveParams(keepalive.ServerParameters{
			MaxConnectionIdle:     10 * time.Minute,
			MaxConnectionAge:      5 * time.Minute,
			MaxConnectionAgeGrace: 1 * time.Minute,
			Time:                  2 * time.Minute,
		}),
	)

bwplotka · 2019-05-16T14:32:17Z

So I don't really understand the problem you are mainly trying to explain here. I don't think any dns refresh is needed. As you got answer is all about the broken connection and grpc client doing DNS lookup and choosing another host. Why is that problematic?

However the actual issue here is not about that. It's about loadbalancing method we can use. In fact we already use one (you have to use something), which is pickfirst. I vote we should have round robin available and maybe some sticky round robin as well that is aware of client. I am pretty sure there are existing Golang implementation of this on the wild (:

bwplotka · 2019-05-16T15:05:24Z

OK, so I talked with @povilasv offline and the issue with RR is:

upscale if you do round robin gprc client wont see the new node

(https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/)

Maybe it's right. I would then suggest using something like this:
https://github.com/improbable-eng/kedge/blob/9a6745fb10e47ef4da101bb8e1cafd329ca397a7/pkg/kedge/http/lbtransport/policy.go#L60
It's battle tested on our prod already.

povilasv · 2019-05-21T08:40:21Z

@bwplotka Maybe you guys could seperate it out into a different go package?

As importing kedge doesn't seem reasonable 🗡️
A part from that 👍

stale · 2020-01-11T06:42:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bwplotka · 2020-07-22T08:19:40Z

We can indeed. Let's get back to this. (:

stale · 2020-08-21T19:02:52Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-08-28T20:06:06Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale · 2020-10-30T23:25:31Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2020-11-09T14:11:58Z

Very needed, potentially blocking #3373

Also again mentioned here https://cloud-native.slack.com/archives/CK5RSSC10/p1604923594031700

stale · 2021-01-08T14:32:32Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun · 2021-01-12T15:23:01Z

Still valid.

stale · 2021-03-16T04:24:46Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-02T16:04:01Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-16T22:04:54Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

aymericDD · 2021-06-17T07:24:01Z

I think it is still interesting.

stale · 2022-11-13T21:19:30Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

yeya24 · 2024-03-29T00:02:44Z

Is this still relevant after we have endpoint group?

bwplotka added difficulty: medium help wanted priority: P1 component: query labels Apr 25, 2019

bwplotka mentioned this issue Jul 16, 2019

Add the thanos-receive-controller with a catch all tenancy observatorium/observatorium#10

Merged

metalmatze mentioned this issue Jul 16, 2019

Figure out the way forward of running multiple Thanos stores observatorium/observatorium#15

Closed

stale bot added the stale label Jan 11, 2020

stale bot closed this as completed Jan 18, 2020

bwplotka reopened this Jul 22, 2020

stale bot removed the stale label Jul 22, 2020

stale bot added the stale label Aug 21, 2020

stale bot closed this as completed Aug 28, 2020

kakkoyun removed the stale label Aug 31, 2020

kakkoyun reopened this Aug 31, 2020

stale bot added the stale label Oct 30, 2020

kakkoyun removed the stale label Oct 31, 2020

stale bot added the stale label Jan 8, 2021

stale bot removed the stale label Jan 12, 2021

stale bot added the stale label Mar 16, 2021

onprem removed the stale label Apr 1, 2021

stale bot added the stale label Jun 2, 2021

stale bot closed this as completed Jun 16, 2021

bwplotka reopened this Aug 12, 2022

stale bot removed the stale label Aug 12, 2022

bwplotka added the GSoC/Community Bridge/LFX label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

bwplotka commented Apr 25, 2019

bwplotka commented Apr 25, 2019

povilasv commented May 3, 2019

povilasv commented May 3, 2019 •

edited

povilasv commented May 3, 2019

bwplotka commented May 9, 2019

povilasv commented May 16, 2019

povilasv commented May 16, 2019

bwplotka commented May 16, 2019

bwplotka commented May 16, 2019

povilasv commented May 21, 2019

stale bot commented Jan 11, 2020

bwplotka commented Jul 22, 2020

stale bot commented Aug 21, 2020

stale bot commented Aug 28, 2020

stale bot commented Oct 30, 2020

bwplotka commented Nov 9, 2020

stale bot commented Jan 8, 2021

kakkoyun commented Jan 12, 2021

stale bot commented Mar 16, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021

aymericDD commented Jun 17, 2021

stale bot commented Nov 13, 2022

yeya24 commented Mar 29, 2024

querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

Comments

bwplotka commented Apr 25, 2019

bwplotka commented Apr 25, 2019

povilasv commented May 3, 2019

povilasv commented May 3, 2019 • edited

povilasv commented May 3, 2019

bwplotka commented May 9, 2019

povilasv commented May 16, 2019

povilasv commented May 16, 2019

bwplotka commented May 16, 2019

bwplotka commented May 16, 2019

povilasv commented May 21, 2019

stale bot commented Jan 11, 2020

bwplotka commented Jul 22, 2020

stale bot commented Aug 21, 2020

stale bot commented Aug 28, 2020

stale bot commented Oct 30, 2020

bwplotka commented Nov 9, 2020

stale bot commented Jan 8, 2021

kakkoyun commented Jan 12, 2021

stale bot commented Mar 16, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021

aymericDD commented Jun 17, 2021

stale bot commented Nov 13, 2022

yeya24 commented Mar 29, 2024

povilasv commented May 3, 2019 •

edited