Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Fix full table scans for objects, files, tables #1709

Merged
merged 4 commits into from
May 31, 2024

Conversation

shawnlewis
Copy link
Contributor

@shawnlewis shawnlewis commented May 31, 2024

We were scanning the whole clickhouse tables for object, file, and table reads, all the time.

Fix so that we only scan the subset of those tables that we need for the specific project referenced in the user query.

This should speed up W&B production reads from about 3s for object_versions, down to just 20ms in many cases (2 orders of magnitude faster). We should get similar improvements for files and table reads. as well.

# There are four major kinds of things that we query:
# - calls,
# - object_versions,
# - tables
# - files
#
# calls are identified by ID.
#
# object_versions, tables, and files are identified by digest. For these kinds of
# things, we dedupe at merge time using Clickhouse's ReplacingMergeTree, but we also
# need to dedupe at query time.
#
# Previously, we did query time deduping in *_deduped VIEWs. But it turns out
# clickhouse doesn't push down the project_id predicate into those views, so we were
# always scanning whole tables.
#
# Now, we've just written the what were views before into this file directly as
# subqueries, and put the project_id predicate in the innermost subquery, which fixes
# the problem.

@shawnlewis shawnlewis requested a review from a team as a code owner May 31, 2024 21:03
@circle-job-mirror
Copy link

circle-job-mirror bot commented May 31, 2024

@@ -806,8 +853,21 @@ def file_create(self, req: tsi.FileCreateReq) -> tsi.FileCreateRes:
return tsi.FileCreateRes(digest=digest)

def file_content_read(self, req: tsi.FileContentReadReq) -> tsi.FileContentReadRes:
# Must dedupe files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry those are comments that are intended to say that we're performing deduping in the query. I could clarify those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, no it is just a note as to why the query is so wild

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are updated to be more clear now.

Copy link
Contributor

@tssweeney tssweeney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - functionally sound testing locally

@shawnlewis shawnlewis merged commit 0e0616b into master May 31, 2024
22 of 24 checks passed
@shawnlewis shawnlewis deleted the perf/object-read-perf branch May 31, 2024 21:38
@github-actions github-actions bot locked and limited conversation to collaborators May 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants