Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up sgr commit / chunking #611

Merged
merged 10 commits into from
Jan 13, 2022
Merged

Conversation

mildbyte
Copy link
Contributor

@mildbyte mildbyte commented Jan 13, 2022

Use PG cursors + FETCH to partition tables on commit.

Instead of doing a pre-flight query to first figure out partition boundaries,
then chunking the table by filtering on those boundaries (poor performance on
tables without PKs and foreign tables), use a PG server-side cursor and run
FETCHes to add data to CStore files.

Two changes to behaviour: not making objects in parallel and moving the object
hashing to be after we've grabbed the data from the cursor into a temporary
object (after which we actually rename it).

Other simplifications to object creation:

When renaming the object, do a pre-flight check to make sure it doesn't
already exist instead of setting up savepoints. If we do get a race, it's fine
because we'll just reupsert existing meta(data) with the same values.

Also simplify create_base_fragment (delete now-unneeded chunking logic)

Benchmarking

Setup: I had a CSV file with some data after running sgr cloud download:

$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | head
timestamp,query_id,remote_address,application,user_id,username,matomo_id,access_token,dbname,original_query,success,used_images
2020-08-27 21:30:24,--XUMXQB-GrJfm60NHBY,82.4.198.115,pgcli,7575f36b-5f6e-4f86-b05e-b1cc91c73fa1,mildbyte,,false,,"SELECT * FROM ""splitgraph/election-geodata:latest"".nation g
WHERE g.state = '36'
        AND g.county = '061' AND g.precinct = '3606183003'
        AND ST_IsValid(geom)",t,"[""splitgraph/election-geodata:latest""]"
$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | wc -l
12232602


$ sgr init benchmark/commit-speed
$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | sgr csv import benchmark/commit-speed query-results
Warning: primary key is not specified, using the whole row as primary key.This is probably not something that you want.

Without a primary key

time sgr commit --snap --overwrite benchmark/commit-speed

Old sgr: started a tqdm progressbar with a high-variance ETA (because objects are committed in parallel, so we get 8 objects bumping the pbar at the same time). Terminated after 25 minutes with tqdm estimating a total time of 1h30m.

New sgr: 3 trials, 5m36s, 5m3s, 5m29s (avg 5m23s)

With a primary key

sgr sql -s benchmark/commit-speed 'ALTER TABLE "query-results" ADD PRIMARY KEY (timestamp, query_id)'

Old sgr: 5m32s, 6m14s, 5m49s (avg 5m52s)
New sgr: 3m23s, 3m39s, 3m15s (avg 3m26s)

Instead of doing a pre-flight query to first figure out partition boundaries,
then chunking the table by filtering on those boundaries (poor performance on
tables without PKs and foreign tables), use a PG server-side cursor and run
FETCHes to add data to CStore files.

Two changes to behaviour: not making objects in parallel and moving the object
hashing to be after we've grabbed the data from the cursor into a temporary
object (after which we actually rename it). The latter might actually speed
things up, since the hashing will be done against the CStore object which is
faster to read.

[still WIP, couple broken tests. Also needs some refactoring since
`create_base_fragment` is now a spaghettifest and I'm not sure if some of its
args are no longer used]
When renaming the object, do a pre-flight check to make sure it doesn't
already exist instead of setting up savepoints. If we do get a race, it's fine
because we'll just reupsert existing meta(data) with the same values.
We now detect those in the object manager, not the engine.
 - remove now-unused `chunk_condition` args
 - make `table_schema` mandatory (all code passes it in anyway)
If this `finally` is hit after an error, we've already rolled back the tx and so
the cursor doesn't exist any more. This means we get an error when we try to
close it, which masks the real error.
@mildbyte mildbyte merged commit fc987c1 into master Jan 13, 2022
@mildbyte mildbyte deleted the feature/CU-1z4bhdb-sgr-commit-speedups branch January 13, 2022 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant