New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist workflow request ids into Cassandra #5826
Conversation
Pull Request Test Coverage Report for Build 018ecf24-a14b-4472-9311-1e5d8cc401a9Details
💛 - Coveralls |
0afc3c8
to
bc1f635
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM but due to a lot of conditional changes in workflow_utils.go it's hard to review. I'd avoid changing existing branches there and focus on addition of new functionality in this PR. In a follow up PR we can improve those if needed
rowTypeShardTaskID = int64(-11) | ||
emptyInitiatedID = int64(-7) | ||
emptyWorkflowRequestVersion = int64(-1000) | ||
workflowRequestTTL = 10800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's include unit of time in the name
common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go
Outdated
Show resolved
Hide resolved
17c1ecb
to
f3d05f0
Compare
What changed? This PR is built on top of #5826. In this PR, we generate workflow requests from external API requests and replication events and store them in database to detect duplicated requests. If a duplicated requests is detected, a DuplicateRequestError is returned by persistence layer with the run_id telling upper layer which run the request has been applied to. And when this error is returned, the API does no-op and return the run_id to the caller. Why? To improve idempotency of Cadence APIs How did you test it? unit tests Potential risks We have a feature flag to turn on/off this feature. And we'll rollout this feature at domain level.
WorkflowRequest:
WorkflowRequest is a new type of data introduced in this PR which will be stored in database to allow detecting duplicate requests. By detecting duplicate requests, we can guarantee idempotency of the following APIs:
A workflow request is identified by (
shard_id
,domain_id
,workflow_id
,request_id
,request_type
,version
), andrun_id
is associated indicating which workflow the request has been applied to. Among them,shard_id
is derived fromworkflow_id
, andversion
is the failover version of the domain when the request is applied.request_id
is a UUID set in Cadence API requests.request_type:
Normally,
request_id
is not reused by clients because it's a UUID, but it may be reused incorrectly by different APIs. For example, a SignalWorkflow request withrequest_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35
has been applied to workflowtest-wf
and later a CancelWorkflow request with the same request_id is sent to the same workflow. In such case, we need to handle the CancelWorkflow request properly. There are 3 options to handle the second request:Our implementation chooses the 2nd option, because it's more reasonable compared to option 1 and requires no change from client side compared to option 3.
request_type
is used for implementation of 2nd option, and withoutrequest_type
we can only go with option 1, because we cannot differentiate different API requests.We currently support 4 request types:
And they are mapped to the APIs mentioned above. However,
SignalWithStartWorkflowExecution
is treated differently. This single API request is mapped to 2 workflow requests stored in database, 1 with typeWorkflowRequestTypeStart
and 1 with typeWorkflowRequestTypeSignal
. And if a workflow is signaled and started by this API withrequest_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35
. Not only SignalWithStartWorkflowExecution with the samerequest_id
will be no-op, but alsoStartWorkflowExecution
andSignalWorkflowExecution
with the samerequest_id
will also be no-op. This is to simplify the implementation and keeps the behavior the same with the current idempotency implementation.Replication of WorkflowRequest:
WorkflowRequests need to be replicated to guarantee idempotency of APIs for global domains after a domain failover. But replication is done asynchronously, and by the time a workflow request is replicated, there might be already an existing request with the same
request_id
in the database, because the request is retried and applied after failover before getting replicated from the old cluster.version
is used for such conflict resolution. And we also introduceCreateWorkflowRequestMode
when writing workflow_requests to database.CreateWorkflowRequestModeNew
is used for workflow_requests generated from Cadence API requests. And we should not apply the request if there is already a workflow_request with the samerequest_id
in the database, even if theversion
is different.CreateWorkflowRequestModeReplicated
is used for workflow_requests generated from replication. And we should always apply the request to achieve eventual consistency.Cassandra Implementation:
Because Cassandra doesn't support cross-table LWT, the workflow_request has to be stored in the existing
execution
table so that the data can be inserted as a part of the transaction which stores the change of mutable state. We uses the following columns of execution table to store workflow_request data:Risk:
This PR has no risk, because it only defines the struct and interface needed for storing workflow_requests. A following PR will use the change and stores data in database.