Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs: fix watch primary address revision and update cache when meets not leader #6279

Merged
merged 8 commits into from
Apr 17, 2023

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Apr 6, 2023

What problem does this PR solve?

Issue Number: Ref #5895.

What is changed and how does it work?

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 6, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • binshi-bing
  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot
Copy link
Member

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Apr 6, 2023
@lhy1024 lhy1024 marked this pull request as ready for review April 6, 2023 15:16
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2023
server/server.go Outdated
revision = wresp.CompactRevision
break
}
if wresp.Canceled {
Copy link
Contributor

@binshi-bing binshi-bing Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that, you have forgotten your comment in my pr and didn't check my reply :)

At the same place, you asked "Do we need to solve other errors?", I replied:

I checked the code in etcd watch.go. There is no other errors to solve, but I do think here using wrep.Err() != nil is better than using wresp.Canceled, because the former can accommodate future etcd changes.

Etcd Watch() returns three types of errors -- closed error, compacted error and cancelled error. closed error is captured by range watchChan, compacted error is captured by wresp.CompactRevision != 0, and cancelled error is captured here. Etcd Watch() itself underlying will retry Watch() for all other errors. Using wrep.Err() != nil is better than using wresp.Canceled.

Changed to using wrep.Err() != nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I forgot about it when the copilot finished

server/server.go Outdated
revision = wresp.CompactRevision
break
}
if wresp.Canceled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change made some improvement, but mightn't be enough. When wresp.Canceled(), it doesn't mean context has been cancelled, so if the primary address loop returns here but the API service is still serving, we'll have improper function. Do we need to try to recreate Watcher and re-watch in a endless loop until context is cancelled, as what I do in KeyspaceGroupManager?

	select {
	case reqc <- wr:
		ok = true
	case <-wr.ctx.Done():
	case <-donec:
		if wgs.closeErr != nil {
			closeCh <- WatchResponse{Canceled: true, closeErr: wgs.closeErr}
			break
		}
		// retry; may have dropped stream from no ctxs
		return w.Watch(ctx, key, opts...)
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

server/server.go Outdated
@@ -1726,7 +1726,7 @@ func (s *Server) watchServicePrimaryAddrLoop(serviceName string) {
log.Info("start to watch", zap.String("service-key", serviceKey))

primary := &tsopb.Participant{}
ok, rev, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)
ok, revision, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only need to watch keyspace group 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, serviceKey := fmt.Sprintf("/ms/%d/%s/%s/%s", s.clusterID, serviceName, fmt.Sprintf("%05d", 0), "primary")

Copy link
Contributor

@binshi-bing binshi-bing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the change having improved things, though we might still need more improvements.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 6, 2023
@@ -1737,15 +1737,26 @@ func (s *Server) watchServicePrimaryAddrLoop(serviceName string) {
} else {
log.Warn("service primary addr doesn't exist", zap.String("service-key", serviceKey))
}
watcher := clientv3.NewWatcher(s.client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many codes with the same logic, the only difference is the key. How about abstracting a function for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will try it later.

Signed-off-by: lhy1024 <admin@liudos.us>
@codecov
Copy link

codecov bot commented Apr 7, 2023

Codecov Report

Patch coverage: 56.00% and project coverage change: -0.12 ⚠️

Comparison is base (8c9b4fb) 75.16% compared to head (c5c6b5d) 75.04%.

❗ Current head c5c6b5d differs from pull request most recent head 596d6e4. Consider uploading reports for the commit 596d6e4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6279      +/-   ##
==========================================
- Coverage   75.16%   75.04%   -0.12%     
==========================================
  Files         404      404              
  Lines       39860    39913      +53     
==========================================
- Hits        29961    29954       -7     
- Misses       7282     7332      +50     
- Partials     2617     2627      +10     
Flag Coverage Δ
unittests 75.04% <56.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/tso/keyspace_group_manager.go 85.29% <0.00%> (ø)
pkg/utils/tsoutil/tso_dispatcher.go 58.03% <28.57%> (-2.72%) ⬇️
server/server.go 74.91% <60.93%> (-0.96%) ⬇️
server/grpc_service.go 49.51% <100.00%> (-0.09%) ⬇️

... and 18 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024 lhy1024 changed the title mcs: fix watch primary address revision mcs: fix watch primary address revision and update cache when meets not leader Apr 7, 2023
Copy link
Contributor

@binshi-bing binshi-bing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is one more comment which might need you to check and confirm.

server/server.go Show resolved Hide resolved
case <-ctx.Done():
return revision, nil
case <-s.updateServicePrimaryAddrCh:
revision, err = s.updateServicePrimaryAddr(serviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely even after we update the service primary address, we still have one more problem -- s.clientConns stores the forwarded hosts' grpc.ClientConn. We never update the broken connections. if a forwarded host's connection broke,e.g., the forwarded host restarted and broken the existing connection, we'll retrieve the broken connection for this forwarded host continuously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will try to create a new connection automatically when the existed connection is closed.

@ti-chi-bot ti-chi-bot removed the status/LGT1 Indicates that a PR has LGTM 1. label Apr 7, 2023
Copy link
Contributor

@binshi-bing binshi-bing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 8, 2023
pkg/utils/tsoutil/tso_dispatcher.go Outdated Show resolved Hide resolved
pkg/utils/tsoutil/tso_dispatcher.go Outdated Show resolved Hide resolved
Comment on lines 125 to 129
if len(updateServicePrimaryAddrChs) > 0 {
if strings.Contains(err.Error(), errs.NotLeaderErr) || strings.Contains(err.Error(), errs.MismatchLeaderErr) {
updateServicePrimaryAddrChs[0] <- struct{}{}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting this part into a function outside the loop, which may help the compiler to do some inline optimization.

@@ -224,7 +224,7 @@ func (s *GrpcServer) Tso(stream pdpb.PD_TsoServer) error {
}

tsoRequest := tsoutil.NewPDProtoRequest(forwardedHost, clientConn, request, stream)
s.tsoDispatcher.DispatchRequest(ctx, tsoRequest, tsoProtoFactory, doneCh, errCh)
s.tsoDispatcher.DispatchRequest(ctx, tsoRequest, tsoProtoFactory, doneCh, errCh, s.updateServicePrimaryAddrCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the channel is always passed to the function, then why use an optional parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DispatchRequest is also used by tso server, tso server is no needed to watch api key

server/server.go Outdated
zap.Time("retry-at", time.Now().Add(watchKEtcdChangeRetryInterval)),
zap.Error(err))
revision = nextRevision
time.Sleep(watchKEtcdChangeRetryInterval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a ticker to select rather than sleeping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we need to wait for a retry here, instead of periodically going to use tick

server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated
primary := &tsopb.Participant{}
ok, revision, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)
listenUrls := primary.GetListenUrls()
if !ok || err != nil || len(listenUrls) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases we may return 0, nil and it breaks the retry loop, is it expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix

server/server.go Outdated

for {
WatchChan:
watchChan := watcher.Watch(s.serverLoopCtx, serviceKey, clientv3.WithPrefix(), clientv3.WithRev(revision))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the prefix?

server/server.go Show resolved Hide resolved
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Comment on lines +1737 to +1741
select {
case <-ctx.Done():
return
case <-time.After(retryIntervalGetServicePrimary):
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call s.updateServicePrimaryAddr(serviceName) at the beginning of the loop? Otherwise, we have to wait for a retryIntervalGetServicePrimary before the first time update.

Copy link
Member

@rleungx rleungx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server/server.go Outdated

// SetServicePrimaryAddr sets the primary address directly.
// Note: This function is only used for test.
func (s *Server) SetServicePrimaryAddr(serviceName, addr string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use it?

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2023
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2023
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 17, 2023
@lhy1024
Copy link
Contributor Author

lhy1024 commented Apr 17, 2023

/merge

@ti-chi-bot
Copy link
Member

@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c5c6b5d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 17, 2023
@ti-chi-bot
Copy link
Member

@lhy1024: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit b9a03d2 into tikv:master Apr 17, 2023
rleungx pushed a commit to rleungx/pd that referenced this pull request Apr 19, 2023
…ot leader (tikv#6279) (tikv#64)

* mcs: fix watch primary address revision

Signed-off-by: lhy1024 <admin@liudos.us>

* add update cache when meets not leader

Signed-off-by: lhy1024 <admin@liudos.us>

---------

Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot bot pushed a commit that referenced this pull request May 6, 2023
ref #5895, ref #6279, close #6289

Signed-off-by: lhy1024 <admin@liudos.us>
zeminzhou pushed a commit to zeminzhou/pd that referenced this pull request May 8, 2023
ref tikv#5895, ref tikv#6279, close tikv#6289

Signed-off-by: lhy1024 <admin@liudos.us>
zeminzhou pushed a commit to zeminzhou/pd that referenced this pull request May 10, 2023
ref tikv#5895, ref tikv#6279, close tikv#6289

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: zeminzhou <zhouzemin@pingcap.com>
rleungx pushed a commit to rleungx/pd that referenced this pull request Jun 5, 2023
* mcs: update client when meet transport is closing (tikv#6341)

* mcs: update client when meet transport is closing

Signed-off-by: lhy1024 <admin@liudos.us>

* address comments

Signed-off-by: lhy1024 <admin@liudos.us>

* add retry

Signed-off-by: lhy1024 <admin@liudos.us>

---------

Signed-off-by: lhy1024 <admin@liudos.us>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: lhy1024 <admin@liudos.us>

* mcs: fix watch primary address revision and update cache when meets not leader (tikv#6279)

ref tikv#5895

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: lhy1024 <admin@liudos.us>

---------

Signed-off-by: lhy1024 <admin@liudos.us>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
rleungx pushed a commit to rleungx/pd that referenced this pull request Aug 2, 2023
ref tikv#5895, ref tikv#6279, close tikv#6289

Signed-off-by: lhy1024 <admin@liudos.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants