Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: add endpointset flow #4421

Merged
merged 11 commits into from Aug 31, 2021
Merged

query: add endpointset flow #4421

merged 11 commits into from Aug 31, 2021

Conversation

hitanshu-mehta
Copy link
Contributor

@hitanshu-mehta hitanshu-mehta commented Jul 6, 2021

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

  • Added unit tests. ( Again similar to unit tests of storeset but made more generic and covered some additional scenarios relevant to endpoint )

pkg/query/endpointset.go Outdated Show resolved Hide resolved
@bill3tt
Copy link
Contributor

bill3tt commented Jul 13, 2021

@hitanshu-mehta just an FYI your first link is broken - it should be https://thanos.io/tip/proposals-accepted/202101-endpoint-discovery.md/

@hitanshu-mehta
Copy link
Contributor Author

@hitanshu-mehta just an FYI your first link is broken - it should be https://thanos.io/tip/proposals-accepted/202101-endpoint-discovery.md/

Changed it. Thank you!

Copy link
Contributor

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today has been a day of really big PRs 😅

I'm afraid that I'm not currently familiar enough with storeset.go to be able to review this PR will any level of detail or confidence.

If there is a lot of code in common, and we plan on deprecating --store and --rule eventually, I'm wondering whether we can make storeset a special case of endpointset or vice versa?

pkg/query/endpointset.go Outdated Show resolved Hide resolved
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have skimmed it. It looks good. I need to spare more time to review the details.

pkg/query/endpointset.go Outdated Show resolved Hide resolved
pkg/query/endpointset.go Show resolved Hide resolved
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitanshu-mehta This looks good. But I failed to find where do we actually use the new EndpointSet? Will there be a follow-up PR? If so we should mention this. Even better let's utilize it in this PR.

@hitanshu-mehta
Copy link
Contributor Author

hitanshu-mehta commented Aug 12, 2021

Earlier I opened #4282 PR, where a new endpoint flow was baked into storeset, which made it difficult to understand and remove old code. here is my discussion regarding this with @bwplotka.

Yes, there will be follow-up PR or I will rebase and update my old PR. My plan is to follow this deprecation plan. Hence, I am doing changes according to that plan. Please let me know if I can improve something :)

bwplotka
bwplotka previously approved these changes Aug 17, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it, LGTM, some suggestions only.

BUT! I would remove storeset and start using new endpointset (with old flags purely, don't change that yet) in this PR.

Reasons:

  • We can verify if endpointset we work on on now works as designed
  • We don't commit code we don't use
  • Reduce migration duration
    WDYT? @hitanshu-mehta ? Up for challenge? I am happy to work with you and do reviews everyday so this can land for Thanos v0.22.0 which we plan to cut on Monday.

pkg/query/endpointset.go Show resolved Hide resolved
}

// StrictStatic returns true if the endpoint has been statically defined and it is under a strict mode.
func (es *grpcEndpointSpec) StrictStatic() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (es *grpcEndpointSpec) StrictStatic() bool {
func (es *grpcEndpointSpec) IsStrictStatic() bool {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we are not creating something for tests? I think I did that for storeset. Otherwise we can remove yea

Yes. We only have one implementation which is also used in tests.

pkg/query/endpointset.go Outdated Show resolved Hide resolved

// endpointSetNodeCollector is a metric collector reporting the number of available storeAPIs for Querier.
// A Collector is required as we want atomic updates for all 'thanos_store_nodes_grpc_connections' series.
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic.
// TODO(hitanshu-mehta) Currently, only collecting metrics of storeAPI. Make this struct generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we cannot make this struct genereric? What about using storeSet node collector then instead of creating new code that has to be refactored in new PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR was already very big, so I thought it would be better if I do this in a separate PR.

Is it fine if I made this struct generic in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, as long as there are not duplicated storeset vs endpointset, all good

pkg/query/endpointset.go Show resolved Hide resolved
pkg/query/endpointset.go Outdated Show resolved Hide resolved
pkg/query/endpointset.go Outdated Show resolved Hide resolved
pkg/query/endpointset.go Outdated Show resolved Hide resolved
}

if er.HasMetricMetadataAPI() {
level.Info(e.logger).Log("msg", "adding new MetricMetadataAPI to query endpointset", "address", addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmhmh, we are not adding anything here ;p just logging I think it might confuse people if some component has all 5 APIs implemented. Users will think we added 5 components but we added one. Maybe more code, but I combine all into one message: added component with A, B ,C APIs.

WDYT? 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good :)

pkg/query/endpointset.go Outdated Show resolved Hide resolved
@hitanshu-mehta
Copy link
Contributor Author

@bwplotka Agree with the reasons. I will do the required changes :)

WDYT? @hitanshu-mehta ? Up for challenge? I am happy to work with you and do reviews everyday so this can land for Thanos v0.22.0 which we plan to cut on Monday.

Up for the challenge!! Thank you the help and lets merge it by Monday 🚀

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
return nil, err
}

srv := grpc.NewServer()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt.semgrep.go.grpc.security.grpc-server-insecure-connection.grpc-server-insecure-connection: Found an insecure gRPC connection. This allows for a connection without encryption to this server. A malicious attacker could tamper with the gRPC message, which could compromise the machine.
(at-me in a reply with help or ignore)

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
bwplotka
bwplotka previously approved these changes Aug 25, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice work. I have small comments only, otherwise LGTM. Let's go before release 💪🏽

func (es *grpcEndpointSpec) Metadata(ctx context.Context, client *endpointClients) (*endpointMetadata, error) {
resp, err := client.info.Info(ctx, &infopb.InfoRequest{}, grpc.WaitForReady(true))
if err != nil {
// Call Info method of StoreAPI, this way querier will be able to discovery old components not exposing InfoAPI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Comment on lines 82 to 86
metadata, err := es.getMetadataUsingStoreAPI(ctx, client.store)
if err != nil {
return nil, errors.Wrapf(err, "fetching info from %s", es.addr)
}
return metadata, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata, err := es.getMetadataUsingStoreAPI(ctx, client.store)
if err != nil {
return nil, errors.Wrapf(err, "fetching info from %s", es.addr)
}
return metadata, nil
metadata, merr := es.getMetadataUsingStoreAPI(ctx, client.store)
if merr != nil {
return nil, errors.Wrapf(merr, "fallback fetching info from %s after err: %v", es.addr, err)
}
return metadata, nil


// endpointSetNodeCollector is a metric collector reporting the number of available storeAPIs for Querier.
// A Collector is required as we want atomic updates for all 'thanos_store_nodes_grpc_connections' series.
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, as long as there are not duplicated storeset vs endpointset, all good

type EndpointSet struct {
logger log.Logger

// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer
// Endpoint specifications can change dynamically. If some component is missing from the list, we assume it is no longer

logger log.Logger

// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer
// accessible and we close gRPC client for it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// accessible and we close gRPC client for it.
// accessible and we close gRPC client for it, unless it is strict.

gRPCInfoCallTimeout: 5 * time.Second,
endpoints: make(map[string]*endpointRef),
endpointStatuses: make(map[string]*EndpointStatus),
unhealthyEndpointTimeout: unhealthyStoreTimeout,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unhealthyEndpointTimeout: unhealthyStoreTimeout,
unhealthyEndpointTimeout: unhealthyEndpointTimeout,

reg *prometheus.Registry,
endpointSpecs func() []EndpointSpec,
dialOpts []grpc.DialOption,
unhealthyStoreTimeout time.Duration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unhealthyStoreTimeout time.Duration,
unhealthyEndpointTimeout time.Duration,

}
}

// TODO(bwplotka): Consider moving storeRef out of this package and renaming it, as it also supports rules API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

clients.store = storepb.NewStoreClient(er.cc)
er.StoreClient = clients.store
} else {
er.clients.store = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because here when we see the endpoint for the first time we assume the StoreAPI is exposed by that endpoint (which may not be true for some ruler) and we create a store API client because as a fallback we might have to call info method of storeAPI.
In this step, I am setting it to null if we find out that the store API is not exposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, some comment would be nice here then

}
return endpointSpec
},
expectedStores: 4, // sidecar + querier + receiver + storeGW
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on those 👍🏽

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
bwplotka
bwplotka previously approved these changes Aug 25, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

clients.store = storepb.NewStoreClient(er.cc)
er.StoreClient = clients.store
} else {
er.clients.store = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, some comment would be nice here then

@bwplotka bwplotka enabled auto-merge (squash) August 25, 2021 19:24
@bwplotka
Copy link
Member

e2e fail might be relevant:
image

Rescheduled, let's see if it repeats.

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
auto-merge was automatically disabled August 26, 2021 12:07

Head branch was pushed to by a user without write access

@hitanshu-mehta
Copy link
Contributor Author

@bwplotka Yes, there was a bug. I have fixed it. Now, e2e tests have passed but units and documentation checks have failed. I tried, but was not able to find the reason :( Can you please help?

GiedriusS
GiedriusS previously approved these changes Aug 31, 2021
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, I get a bunch of "Duplicate store address is provided" spam:

Rgp 31 13:08:57 XXX thanos_endpoint[166897]: level=info ts=2021-08-31T13:08:57.843619303Z caller=endpointset.go:360 component=endpointset msg="adding new sidecar with [storeAPI rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI]" address=1.2.3.4:10901 extLset="{dc=\"aaaa\", www=\"1\"}"

...
Rgp 31 13:09:02 XXX thanos_endpoint[166897]: level=warn ts=2021-08-31T13:09:02.833746591Z caller=query.go:637 msg="Duplicate store address is provided" addr=1.2.3.4:10901     

I have both --store and --rule pointing to the same IP addresses (Sidecar nodes). Removing --rule fixes this problem. Presumably, this happens because the same endpoint spec is added for all flags. Maybe we can simply improve the logging so that a message would be printed only if duplicate nodes have been specified with the same flags? The deduplication should still happen at the end, though, but without any logging.

Perhaps we could fix this before merging @hitanshu-mehta ?

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
GiedriusS
GiedriusS previously approved these changes Aug 31, 2021
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work 👍

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
@GiedriusS GiedriusS enabled auto-merge (squash) August 31, 2021 15:25
@bwplotka bwplotka disabled auto-merge August 31, 2021 16:31
@bwplotka bwplotka merged commit 8862ad5 into thanos-io:main Aug 31, 2021
@bwplotka
Copy link
Member

🎉

@hanjm
Copy link
Member

hanjm commented Sep 9, 2021

🎉 hi, could it support file sd?

someshkoli pushed a commit to someshkoli/thanos that referenced this pull request Nov 7, 2021
* Create endpoint flow

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* add unit test for endpointSet

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* lint fixes

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* fix typo

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* remove code smells

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* start using endpointset instead of storeset

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* remove storeset

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* minor nits

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* Fix failing e2e tests

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* improve logging

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>

* fix comment

Signed-off-by: Hitanshu Mehta <hitanshu99amehta@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants