You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some of our self-hosted clusters we observed occasional inconsistency between the primary and the advanced visibilty storages, e.g. completed workflows being listed as running in ListWorkflowExecutions response until they're deleted by the retention policy.
I guess the same issue was reported by other Temporal users in community and Slack.
Steps to Reproduce the Problem
As far as I understand, the immediate queue implementation and its FIFO scheduler do not guarantee any particular order of tasks execution. A queue reader selects tasks in batch, and the scheduler dispatches them among multiple workers (512 by default for visibility processor). And if a single task processing gets delayed (for any reason, could be the database or the workflow context being busy), tasks ahead could be executed and completed meanwhile.
For a single execution, transfer tasks are serialized by mutable state conditions and database concurrency control mechanisms; but visibility tasks seem to allow arbitrary order of execution, and the visibility storage doesn't implement any concurrency control.
main...aromanovich:temporal:visibility-race shows how this becomes an issue: go test . -persistenceType=sql -persistenceDriver=postgres12 -run 'AdvancedVisibilitySuite/TestVisibilityRace' -v will fail when SLOWDOWN_VISIBILITY_UPSERT=1 env var is set and succeed otherwise.
The test is very simple: it starts an execution, changes its search attributes and terminates it without any delays; then waits for some time and checks that the visibility data is consistent with the workflow state. SLOWDOWN_VISIBILITY_UPSERT=1slows down UpsertExecutionVisibilityTask by one second, imitating an occasional persistence delay.
This delay causes UpsertExecutionVisibilityTask to succeed after CloseExecutionVisibilityTask, which leaves the visibility data in an inconsistent state:
2024-04-02T00:19:01.581+0200 info Sleeping for 1 second {"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:237"}
2024-04-02T00:19:01.602+0200 info processCloseExecution {"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.600Z", "finishedAt": "2024-04-01T22:19:01.602Z", "logging-call-at": "visibility_queue_task_executor.go:256"}
2024-04-02T00:19:02.581+0200 info Finished sleeping for 1 second {"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:239"}
2024-04-02T00:19:02.585+0200 info processUpsertExecution {"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.580Z", "finishedAt": "2024-04-01T22:19:02.585Z", "logging-call-at": "visibility_queue_task_executor.go:200"}
Specifications
Version: 1.22.5
Platform: All
The text was updated successfully, but these errors were encountered:
Expected Behavior
Visibility data is eventually consistent.
Actual Behavior
In some of our self-hosted clusters we observed occasional inconsistency between the primary and the advanced visibilty storages, e.g. completed workflows being listed as running in ListWorkflowExecutions response until they're deleted by the retention policy.
I guess the same issue was reported by other Temporal users in community and Slack.
Steps to Reproduce the Problem
As far as I understand, the immediate queue implementation and its FIFO scheduler do not guarantee any particular order of tasks execution. A queue reader selects tasks in batch, and the scheduler dispatches them among multiple workers (512 by default for visibility processor). And if a single task processing gets delayed (for any reason, could be the database or the workflow context being busy), tasks ahead could be executed and completed meanwhile.
For a single execution, transfer tasks are serialized by mutable state conditions and database concurrency control mechanisms; but visibility tasks seem to allow arbitrary order of execution, and the visibility storage doesn't implement any concurrency control.
main...aromanovich:temporal:visibility-race shows how this becomes an issue:
go test . -persistenceType=sql -persistenceDriver=postgres12 -run 'AdvancedVisibilitySuite/TestVisibilityRace' -v
will fail whenSLOWDOWN_VISIBILITY_UPSERT=1
env var is set and succeed otherwise.The test is very simple: it starts an execution, changes its search attributes and terminates it without any delays; then waits for some time and checks that the visibility data is consistent with the workflow state.
SLOWDOWN_VISIBILITY_UPSERT=1
slows down UpsertExecutionVisibilityTask by one second, imitating an occasional persistence delay.This delay causes UpsertExecutionVisibilityTask to succeed after CloseExecutionVisibilityTask, which leaves the visibility data in an inconsistent state:
Specifications
The text was updated successfully, but these errors were encountered: