Persist workflow request ids into Cassandra #5826

Shaddoll · 2024-03-29T22:10:28Z

WorkflowRequest:

WorkflowRequest is a new type of data introduced in this PR which will be stored in database to allow detecting duplicate requests. By detecting duplicate requests, we can guarantee idempotency of the following APIs:

StartWorkflowExecution
SignalWorkflowExecution
CancelWorkflowExecution
ResetWorkflowExecution
SignalWithStartWorkflowExecution

A workflow request is identified by (shard_id, domain_id, workflow_id, request_id, request_type, version), and run_id is associated indicating which workflow the request has been applied to. Among them, shard_id is derived from workflow_id, and version is the failover version of the domain when the request is applied. request_id is a UUID set in Cadence API requests.

request_type:

Normally, request_id is not reused by clients because it's a UUID, but it may be reused incorrectly by different APIs. For example, a SignalWorkflow request with request_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35 has been applied to workflow test-wf and later a CancelWorkflow request with the same request_id is sent to the same workflow. In such case, we need to handle the CancelWorkflow request properly. There are 3 options to handle the second request:

Make it a no-op
Apply the request
Return an error telling the customer that the request_id shouldn't be reused

Our implementation chooses the 2nd option, because it's more reasonable compared to option 1 and requires no change from client side compared to option 3. request_type is used for implementation of 2nd option, and without request_type we can only go with option 1, because we cannot differentiate different API requests.

We currently support 4 request types:

WorkflowRequestTypeStart
WorkflowRequestTypeSignal
WorkflowRequestTypeCancel
WorkflowRequestTypeReset

And they are mapped to the APIs mentioned above. However, SignalWithStartWorkflowExecution is treated differently. This single API request is mapped to 2 workflow requests stored in database, 1 with type WorkflowRequestTypeStart and 1 with type WorkflowRequestTypeSignal. And if a workflow is signaled and started by this API with request_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35. Not only SignalWithStartWorkflowExecution with the same request_id will be no-op, but also StartWorkflowExecution and SignalWorkflowExecution with the same request_id will also be no-op. This is to simplify the implementation and keeps the behavior the same with the current idempotency implementation.

Replication of WorkflowRequest:

WorkflowRequests need to be replicated to guarantee idempotency of APIs for global domains after a domain failover. But replication is done asynchronously, and by the time a workflow request is replicated, there might be already an existing request with the same request_id in the database, because the request is retried and applied after failover before getting replicated from the old cluster. version is used for such conflict resolution. And we also introduce CreateWorkflowRequestMode when writing workflow_requests to database.

CreateWorkflowRequestModeNew is used for workflow_requests generated from Cadence API requests. And we should not apply the request if there is already a workflow_request with the same request_id in the database, even if the version is different.
CreateWorkflowRequestModeReplicated is used for workflow_requests generated from replication. And we should always apply the request to achieve eventual consistency.

Cassandra Implementation:

Because Cassandra doesn't support cross-table LWT, the workflow_request has to be stored in the existing execution table so that the data can be inserted as a part of the transaction which stores the change of mutable state. We uses the following columns of execution table to store workflow_request data:

shard_id -> shard_id
type -> request_type
domain_id -> domain_id
workflow_id -> workflow_id
run_id -> request_id
task_id -> version * -1 (this is because we want to fetch the data in descending order)
current_run_id -> run_id

Risk:

This PR has no risk, because it only defines the struct and interface needed for storing workflow_requests. A following PR will use the change and stores data in database.

codecov · 2024-03-29T22:27:58Z

Codecov Report

Merging #5826 (8bf7a5d) into master (5e98036) will increase coverage by 0.05%.
The diff coverage is 89.43%.

Additional details and impacted files

Files	Coverage Δ
common/persistence/data_manager_interfaces.go	`95.50% <ø> (ø)`
common/persistence/data_store_interfaces.go	`100.00% <ø> (ø)`
...ersistence/nosql/nosqlplugin/cassandra/workflow.go	`100.00% <100.00%> (ø)`
common/persistence/errors.go	`0.00% <0.00%> (ø)`
common/persistence/execution_manager.go	`58.42% <60.00%> (+0.01%)`	⬆️
...on/persistence/nosql/nosql_execution_store_util.go	`84.29% <69.56%> (-0.08%)`	⬇️
common/persistence/nosql/nosql_execution_store.go	`73.53% <75.00%> (-0.13%)`	⬇️
...ence/nosql/nosqlplugin/cassandra/workflow_utils.go	`93.14% <95.33%> (+0.27%)`	⬆️

... and 8 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e98036...8bf7a5d. Read the comment docs.

coveralls · 2024-03-29T23:15:43Z

Pull Request Test Coverage Report for Build 018ecf24-a14b-4472-9311-1e5d8cc401a9

Details

285 of 484 (58.88%) changed or added relevant lines in 7 files are covered.
28 unchanged lines in 8 files lost coverage.
Overall coverage decreased (-0.02%) to 67.342%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/persistence/errors.go	0	3	0.0%
common/persistence/nosql/nosql_execution_store.go	36	43	83.72%
common/persistence/nosql/nosql_execution_store_util.go	17	25	68.0%
common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go	210	220	95.45%
common/persistence/persistence-tests/executionManagerTest.go	0	171	0.0%

Files with Coverage Reduction	New Missed Lines	%
common/task/parallel_task_processor.go	2	93.06%
common/task/fifo_task_scheduler.go	2	87.63%
service/frontend/api/handler.go	2	62.17%
common/util.go	2	91.78%
service/matching/taskListManager.go	3	79.7%
common/archiver/filestore/historyArchiver.go	4	80.95%
service/history/task/fetcher.go	4	85.05%
service/frontend/wrappers/metered/metered.go	9	63.18%

Totals
Change from base Build 018ecf22-e960-4e04-b6ff-2f5b7cc1fd47:	-0.02%
Covered Lines:	98448
Relevant Lines:	146191

💛 - Coveralls

taylanisikdemir

overall LGTM but due to a lot of conditional changes in workflow_utils.go it's hard to review. I'd avoid changing existing branches there and focus on addition of new functionality in this PR. In a follow up PR we can improve those if needed

taylanisikdemir · 2024-04-10T22:46:57Z

common/persistence/nosql/nosqlplugin/cassandra/constants.go

+	rowTypeShardTaskID          = int64(-11)
+	emptyInitiatedID            = int64(-7)
+	emptyWorkflowRequestVersion = int64(-1000)
+	workflowRequestTTL          = 10800


let's include unit of time in the name

common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go

common/persistence/nosql/nosqlplugin/cassandra/workflow_utils_test.go

common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go

common/persistence/persistence-tests/executionManagerTest.go

What changed? This PR is built on top of #5826. In this PR, we generate workflow requests from external API requests and replication events and store them in database to detect duplicated requests. If a duplicated requests is detected, a DuplicateRequestError is returned by persistence layer with the run_id telling upper layer which run the request has been applied to. And when this error is returned, the API does no-op and return the run_id to the caller. Why? To improve idempotency of Cadence APIs How did you test it? unit tests Potential risks We have a feature flag to turn on/off this feature. And we'll rollout this feature at domain level.

Shaddoll force-pushed the db branch from 2726bef to 3bdcd47 Compare March 29, 2024 22:51

Shaddoll force-pushed the db branch 6 times, most recently from 0afc3c8 to bc1f635 Compare April 10, 2024 17:14

Shaddoll marked this pull request as ready for review April 10, 2024 22:28

Shaddoll requested review from neil-xie, davidporter-id-au, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners April 10, 2024 22:28

Shaddoll marked this pull request as draft April 10, 2024 22:28

taylanisikdemir reviewed Apr 10, 2024

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go Outdated Show resolved Hide resolved

taylanisikdemir reviewed Apr 10, 2024

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go Show resolved Hide resolved

Shaddoll marked this pull request as ready for review April 11, 2024 16:53

Shaddoll force-pushed the db branch from bc1f635 to e723606 Compare April 11, 2024 19:58

taylanisikdemir reviewed Apr 11, 2024

View reviewed changes

common/persistence/persistence-tests/executionManagerTest.go Outdated Show resolved Hide resolved

taylanisikdemir reviewed Apr 11, 2024

View reviewed changes

common/persistence/persistence-tests/executionManagerTest.go Show resolved Hide resolved

taylanisikdemir approved these changes Apr 11, 2024

View reviewed changes

Shaddoll force-pushed the db branch 2 times, most recently from 17c1ecb to f3d05f0 Compare April 11, 2024 21:27

Persist workflow request ids into Cassandra

8a67e6c

Shaddoll force-pushed the db branch from f3d05f0 to 8a67e6c Compare April 11, 2024 21:46

Merge branch 'master' into db

8bf7a5d

Shaddoll merged commit 8249589 into uber:master Apr 11, 2024
20 checks passed

Shaddoll deleted the db branch April 11, 2024 22:11

Shaddoll mentioned this pull request Apr 15, 2024

Update mutable state to generate workflow requests #5821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist workflow request ids into Cassandra #5826

Persist workflow request ids into Cassandra #5826

Shaddoll commented Mar 29, 2024 •

edited

codecov bot commented Mar 29, 2024 •

edited

coveralls commented Mar 29, 2024 •

edited

taylanisikdemir left a comment

taylanisikdemir Apr 10, 2024

Persist workflow request ids into Cassandra #5826

Persist workflow request ids into Cassandra #5826

Conversation

Shaddoll commented Mar 29, 2024 • edited

WorkflowRequest:

request_type:

Replication of WorkflowRequest:

Cassandra Implementation:

Risk:

codecov bot commented Mar 29, 2024 • edited

Codecov Report

coveralls commented Mar 29, 2024 • edited

Pull Request Test Coverage Report for Build 018ecf24-a14b-4472-9311-1e5d8cc401a9

Details

💛 - Coveralls

taylanisikdemir left a comment

Choose a reason for hiding this comment

taylanisikdemir Apr 10, 2024

Choose a reason for hiding this comment

Shaddoll commented Mar 29, 2024 •

edited

codecov bot commented Mar 29, 2024 •

edited

coveralls commented Mar 29, 2024 •

edited