Skip to content

Return Unavailable to frontend rpcs until healthy#5069

Merged
dnr merged 7 commits intotemporalio:mainfrom
dnr:health1
Dec 12, 2023
Merged

Return Unavailable to frontend rpcs until healthy#5069
dnr merged 7 commits intotemporalio:mainfrom
dnr:health1

Conversation

@dnr
Copy link
Contributor

@dnr dnr commented Nov 3, 2023

What changed?
Add an interceptor to return Unavailable to WorkflowService methods until the frontend considers itself "healthy", which currently means "membership is initialized".

Why?
Fixes #5015

How did you test it?
mostly manually

Potential risks
This adds a window of time where frontend can now return Unavailable where previously it might have succeeded or returned a different error code. Specifically note that client.Dial in go sdk (at least) will fail fast on this error and the caller will need to retry.

Comment on lines 59 to 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother splitting the path? You can just check for a prefix:

Suggested change
servicePrefix, _ := SplitMethodName(info.FullMethod)
if servicePrefix == api.WorkflowServicePrefix {
if strings.HasPrefix(info.FullMethod, api.WorkflowServicePrefix) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. all the other interceptors use that function so it seemed fitting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, comments welcome on whether it should apply to other services too. on second thought it should probably be workflowservice + operatorservice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there services it shouldnt' apply to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue not AdminService, since theoretically there might be something in there to repair an unhealthy frontend (although there's nothing like that now).
And also not grpc health service, since that should be able to return successful "not serving" responses.

@dnr dnr merged commit c8aba5e into temporalio:main Dec 12, 2023
@dnr dnr deleted the health1 branch December 12, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo

2 participants