Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tasklist and implement history scavenger for SQL #4059

Merged
merged 27 commits into from
Apr 12, 2021

Conversation

kraney
Copy link
Contributor

@kraney kraney commented Mar 17, 2021

There are deficiencies in the TaskList scanner (which only runs for SQL storage) that allow tasks to build up in the tasks table.

  • scavenger stops processing when it reaches a task that has not expired, regardless of whether that task is completed. But tasks may have infinite timeouts. Not only is such a task itself not garbage collected, no task that comes sequentially after it is garbage collected from that point, either.
  • Entries may be deleted from the task_lists table while related entries remain in the tasks table (no foreign key constraints.) In the case, the scavenger will never attempt to clean up those tasks.

This changeset attempts to address these deficiencies in the following ways

  • The "stop" logic in the tasklist scanner has been altered to match the logic used by TaskGC in history; it deletes all tasks less than the persisted ACK level, regardless of expiration time. Since history never reads tasks with an ID less than the ackLevel, this is deemed safe.
  • An additional task was added to the tasklist scanner that queries for orphaned tasks, and deletes them when found.
  • In support of the above, a new GetOrphanTasks() query was added to the persistence interface. It is left unimplemented for Cassandra, since the tasklist scanner is unused there.

In a similar vein, for SQL storage the history scanner does not run. This allows garbage to accumulate indefinitely into the history_tree and history_node tables. Seemingly, the reason for this omission is that the GetAllHistoryBranches() query is unimplemented for SQL. This changeset implements that query, and then takes advantage of this to go ahead and enable the history scanner for SQL storage.

Without this change, cruft will accumulate in the database indefinitely, increasing storage used and also increasing DB load and query latency.

These change are patched in from an operational fork of this project. The changes are working in operation, however the patch process may have introduced regressions, the post-patch code is much less well tested.

I ran the automated tests. Some tests that are likely important fail

  • I don't know how to make the postgres and mysql tests run correctly; they seem to require a local running instance of the db as a prerequisite.
  • I declined to rewrite the test suite for the tasklist scavenger, which is fully predicated on assumptions around stopping once an unexpired task is reached. Those assumptions no longer hold. Someone upstream needs to reason through whether the change is valid & acceptable. (However, one bit of evidence in its favor is it simply mimics the logic of TaskGC.)

If you're using Cassandra, nothing.

If using SQL, the scavenger could consume excessive resources, especially on the first run where there's an existing buildup of unmanaged garbage. It could delete tasks from the task table that should not be deleted. The history scanner similarly could consume excess resources, although the history scanner code is not new; it's identically the same code used for Cassandra. It's only newly enabled for SQL.

@longquanzheng
Copy link
Collaborator

longquanzheng commented Mar 18, 2021

Wow, thanks for the PR!!! These are lots of improvements.

Btw, is this to replace #4019 ?

@kraney
Copy link
Contributor Author

kraney commented Mar 18, 2021

This change picks up on that branch and includes that change, so this PR could be considered a replacement, yes.

@longquanzheng
Copy link
Collaborator

longquanzheng commented Mar 19, 2021

I don't know how to make the postgres and mysql tests run correctly; they seem to require a local running instance of the db as a prerequisite.

Yes, we have a very rough instructions about testing. Need more improvement, hehe
see https://github.com/uber/cadence/blob/master/docs/setup/CONTRIBUTING.md#testing--debug

I declined to rewrite the test suite for the tasklist scavenger

Totally acceptable. leaking is very slow, usually people already identify and fix it before they could be an issue. In fact it's quite hard to write the tests :(

Copy link
Collaborator

@longquanzheng longquanzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making such a big contribution. Really, really appreciate it.

kraney added a commit to kraney/cadence-docs that referenced this pull request Mar 19, 2021
Copy link
Collaborator

@longquanzheng longquanzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above are mostly all my comments. This PR is very good work. Probably the best one from community so far.
cc @meiliang86

Copy link
Contributor

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the contribution! Definitely one of the highest quality PRs from the open-source community and it requires a very deep understanding of Cadence internals. Thank you!

service/worker/scanner/tasklist/handler.go Outdated Show resolved Hide resolved
service/worker/scanner/tasklist/scavenger.go Outdated Show resolved Hide resolved
@longquanzheng longquanzheng changed the title Address several deficiencies in the tasklist and history scavenger for SQL Improve tasklist and implement history scavenger for SQL Apr 11, 2021
@coveralls
Copy link

Pull Request Test Coverage Report for Build 7f584f05-26bb-4d9d-8345-2c500d9de2c4

  • 94 of 350 (26.86%) changed or added relevant lines in 19 files are covered.
  • 123 unchanged lines in 14 files lost coverage.
  • Overall coverage decreased (-0.2%) to 67.868%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/taskManager.go 0 3 0.0%
common/persistence/sql/sqlplugin/mysql/task.go 8 12 66.67%
common/persistence/sql/sqlplugin/postgres/task.go 6 10 60.0%
service/worker/scanner/tasklist/scavenger.go 18 22 81.82%
common/persistence/persistenceRateLimitedClients.go 0 5 0.0%
common/persistence/cassandra/cassandraTaskPersistence.go 0 6 0.0%
service/worker/scanner/scanner.go 0 7 0.0%
common/persistence/persistenceMetricClients.go 0 9 0.0%
common/persistence/persistenceErrorInjectionClients.go 0 19 0.0%
service/worker/scanner/tasklist/db.go 13 33 39.39%
Files with Coverage Reduction New Missed Lines %
service/history/execution/mutable_state_task_refresher.go 1 75.29%
service/history/queue/timer_queue_processor_base.go 1 78.6%
service/worker/scanner/scanner.go 1 6.21%
service/worker/scanner/tasklist/handler.go 1 47.42%
service/worker/scanner/tasklist/scavenger.go 1 83.16%
common/persistence/executionStore.go 2 74.15%
common/task/weightedRoundRobinTaskScheduler.go 2 89.12%
common/task/fifoTaskScheduler.go 3 84.54%
common/persistence/sql/sqlExecutionManager.go 5 57.07%
common/persistence/cassandra/cassandraPersistence.go 6 52.56%
Totals Coverage Status
Change from base Build 946be230-e4c5-4e64-848d-3742f7226e8f: -0.2%
Covered Lines: 98730
Relevant Lines: 145474

💛 - Coveralls

@longquanzheng longquanzheng merged commit 801fda7 into uber:master Apr 12, 2021
yux0 pushed a commit to yux0/cadence that referenced this pull request May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants