Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist workflow request ids into Cassandra #5826

Merged
merged 2 commits into from Apr 11, 2024
Merged

Conversation

Shaddoll
Copy link
Contributor

@Shaddoll Shaddoll commented Mar 29, 2024

WorkflowRequest:

WorkflowRequest is a new type of data introduced in this PR which will be stored in database to allow detecting duplicate requests. By detecting duplicate requests, we can guarantee idempotency of the following APIs:

  • StartWorkflowExecution
  • SignalWorkflowExecution
  • CancelWorkflowExecution
  • ResetWorkflowExecution
  • SignalWithStartWorkflowExecution

A workflow request is identified by (shard_id, domain_id, workflow_id, request_id, request_type, version), and run_id is associated indicating which workflow the request has been applied to. Among them, shard_id is derived from workflow_id, and version is the failover version of the domain when the request is applied. request_id is a UUID set in Cadence API requests.

request_type:

Normally, request_id is not reused by clients because it's a UUID, but it may be reused incorrectly by different APIs. For example, a SignalWorkflow request with request_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35 has been applied to workflow test-wf and later a CancelWorkflow request with the same request_id is sent to the same workflow. In such case, we need to handle the CancelWorkflow request properly. There are 3 options to handle the second request:

  1. Make it a no-op
  2. Apply the request
  3. Return an error telling the customer that the request_id shouldn't be reused

Our implementation chooses the 2nd option, because it's more reasonable compared to option 1 and requires no change from client side compared to option 3. request_type is used for implementation of 2nd option, and without request_type we can only go with option 1, because we cannot differentiate different API requests.

We currently support 4 request types:

  • WorkflowRequestTypeStart
  • WorkflowRequestTypeSignal
  • WorkflowRequestTypeCancel
  • WorkflowRequestTypeReset

And they are mapped to the APIs mentioned above. However, SignalWithStartWorkflowExecution is treated differently. This single API request is mapped to 2 workflow requests stored in database, 1 with type WorkflowRequestTypeStart and 1 with type WorkflowRequestTypeSignal. And if a workflow is signaled and started by this API with request_id: 7cfc8e22-b252-447d-ab03-9e5eda68dc35. Not only SignalWithStartWorkflowExecution with the same request_id will be no-op, but also StartWorkflowExecution and SignalWorkflowExecution with the same request_id will also be no-op. This is to simplify the implementation and keeps the behavior the same with the current idempotency implementation.

Replication of WorkflowRequest:

WorkflowRequests need to be replicated to guarantee idempotency of APIs for global domains after a domain failover. But replication is done asynchronously, and by the time a workflow request is replicated, there might be already an existing request with the same request_id in the database, because the request is retried and applied after failover before getting replicated from the old cluster. version is used for such conflict resolution. And we also introduce CreateWorkflowRequestMode when writing workflow_requests to database.

  • CreateWorkflowRequestModeNew is used for workflow_requests generated from Cadence API requests. And we should not apply the request if there is already a workflow_request with the same request_id in the database, even if the version is different.
  • CreateWorkflowRequestModeReplicated is used for workflow_requests generated from replication. And we should always apply the request to achieve eventual consistency.

Cassandra Implementation:

Because Cassandra doesn't support cross-table LWT, the workflow_request has to be stored in the existing execution table so that the data can be inserted as a part of the transaction which stores the change of mutable state. We uses the following columns of execution table to store workflow_request data:

  • shard_id -> shard_id
  • type -> request_type
  • domain_id -> domain_id
  • workflow_id -> workflow_id
  • run_id -> request_id
  • task_id -> version * -1 (this is because we want to fetch the data in descending order)
  • current_run_id -> run_id

Risk:

This PR has no risk, because it only defines the struct and interface needed for storing workflow_requests. A following PR will use the change and stores data in database.

Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Merging #5826 (8bf7a5d) into master (5e98036) will increase coverage by 0.05%.
The diff coverage is 89.43%.

Additional details and impacted files
Files Coverage Δ
common/persistence/data_manager_interfaces.go 95.50% <ø> (ø)
common/persistence/data_store_interfaces.go 100.00% <ø> (ø)
...ersistence/nosql/nosqlplugin/cassandra/workflow.go 100.00% <100.00%> (ø)
common/persistence/errors.go 0.00% <0.00%> (ø)
common/persistence/execution_manager.go 58.42% <60.00%> (+0.01%) ⬆️
...on/persistence/nosql/nosql_execution_store_util.go 84.29% <69.56%> (-0.08%) ⬇️
common/persistence/nosql/nosql_execution_store.go 73.53% <75.00%> (-0.13%) ⬇️
...ence/nosql/nosqlplugin/cassandra/workflow_utils.go 93.14% <95.33%> (+0.27%) ⬆️

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e98036...8bf7a5d. Read the comment docs.

@coveralls
Copy link

coveralls commented Mar 29, 2024

Pull Request Test Coverage Report for Build 018ecf24-a14b-4472-9311-1e5d8cc401a9

Details

  • 285 of 484 (58.88%) changed or added relevant lines in 7 files are covered.
  • 28 unchanged lines in 8 files lost coverage.
  • Overall coverage decreased (-0.02%) to 67.342%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/errors.go 0 3 0.0%
common/persistence/nosql/nosql_execution_store.go 36 43 83.72%
common/persistence/nosql/nosql_execution_store_util.go 17 25 68.0%
common/persistence/nosql/nosqlplugin/cassandra/workflow_utils.go 210 220 95.45%
common/persistence/persistence-tests/executionManagerTest.go 0 171 0.0%
Files with Coverage Reduction New Missed Lines %
common/task/parallel_task_processor.go 2 93.06%
common/task/fifo_task_scheduler.go 2 87.63%
service/frontend/api/handler.go 2 62.17%
common/util.go 2 91.78%
service/matching/taskListManager.go 3 79.7%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/history/task/fetcher.go 4 85.05%
service/frontend/wrappers/metered/metered.go 9 63.18%
Totals Coverage Status
Change from base Build 018ecf22-e960-4e04-b6ff-2f5b7cc1fd47: -0.02%
Covered Lines: 98448
Relevant Lines: 146191

💛 - Coveralls

Copy link
Contributor

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM but due to a lot of conditional changes in workflow_utils.go it's hard to review. I'd avoid changing existing branches there and focus on addition of new functionality in this PR. In a follow up PR we can improve those if needed

rowTypeShardTaskID = int64(-11)
emptyInitiatedID = int64(-7)
emptyWorkflowRequestVersion = int64(-1000)
workflowRequestTTL = 10800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's include unit of time in the name

@Shaddoll Shaddoll marked this pull request as ready for review April 11, 2024 16:53
@Shaddoll Shaddoll merged commit 8249589 into uber:master Apr 11, 2024
20 checks passed
@Shaddoll Shaddoll deleted the db branch April 11, 2024 22:11
Shaddoll added a commit that referenced this pull request Apr 17, 2024
What changed?
This PR is built on top of #5826.
In this PR, we generate workflow requests from external API requests and replication events and store them in database to detect duplicated requests. If a duplicated requests is detected, a DuplicateRequestError is returned by persistence layer with the run_id telling upper layer which run the request has been applied to. And when this error is returned, the API does no-op and return the run_id to the caller.

Why?
To improve idempotency of Cadence APIs

How did you test it?
unit tests

Potential risks
We have a feature flag to turn on/off this feature. And we'll rollout this feature at domain level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants