Visibility data can become inconsistent #5643

aromanovich · 2024-04-02T10:27:00Z

Expected Behavior

Visibility data is eventually consistent.

Actual Behavior

In some of our self-hosted clusters we observed occasional inconsistency between the primary and the advanced visibilty storages, e.g. completed workflows being listed as running in ListWorkflowExecutions response until they're deleted by the retention policy.
I guess the same issue was reported by other Temporal users in community and Slack.

Steps to Reproduce the Problem

As far as I understand, the immediate queue implementation and its FIFO scheduler do not guarantee any particular order of tasks execution. A queue reader selects tasks in batch, and the scheduler dispatches them among multiple workers (512 by default for visibility processor). And if a single task processing gets delayed (for any reason, could be the database or the workflow context being busy), tasks ahead could be executed and completed meanwhile.

For a single execution, transfer tasks are serialized by mutable state conditions and database concurrency control mechanisms; but visibility tasks seem to allow arbitrary order of execution, and the visibility storage doesn't implement any concurrency control.

main...aromanovich:temporal:visibility-race shows how this becomes an issue:
go test . -persistenceType=sql -persistenceDriver=postgres12 -run 'AdvancedVisibilitySuite/TestVisibilityRace' -v will fail when SLOWDOWN_VISIBILITY_UPSERT=1 env var is set and succeed otherwise.

The test is very simple: it starts an execution, changes its search attributes and terminates it without any delays; then waits for some time and checks that the visibility data is consistent with the workflow state.
SLOWDOWN_VISIBILITY_UPSERT=1 slows down UpsertExecutionVisibilityTask by one second, imitating an occasional persistence delay.
This delay causes UpsertExecutionVisibilityTask to succeed after CloseExecutionVisibilityTask, which leaves the visibility data in an inconsistent state:

2024-04-02T00:19:01.581+0200	info	Sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:237"}
2024-04-02T00:19:01.602+0200	info	processCloseExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.600Z", "finishedAt": "2024-04-01T22:19:01.602Z", "logging-call-at": "visibility_queue_task_executor.go:256"}
2024-04-02T00:19:02.581+0200	info	Finished sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:239"}
2024-04-02T00:19:02.585+0200	info	processUpsertExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.580Z", "finishedAt": "2024-04-01T22:19:02.585Z", "logging-call-at": "visibility_queue_task_executor.go:200"}

Specifications

Version: 1.22.5
Platform: All

The text was updated successfully, but these errors were encountered:

aromanovich added the potential-bug label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visibility data can become inconsistent #5643

Visibility data can become inconsistent #5643

aromanovich commented Apr 2, 2024 •

edited

Visibility data can become inconsistent #5643

Visibility data can become inconsistent #5643

Comments

aromanovich commented Apr 2, 2024 • edited

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

aromanovich commented Apr 2, 2024 •

edited