New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed pg_dump/restore shelling out to using postgres' COPY TO/FROM #18
Conversation
Looks like we can't pipe the object directly to S3 (or an S3-like storage) because the API needs to know the size of the object in advance before it's uploaded: minio/mc#2271 minio/minio-py#650 |
54c79d3
to
6f61bc6
Compare
rebased on experiment/poetry so that I can play around with alternative S3 clients |
Piping directly to smart_open -> boto instead of minio: slower, since smart_open's connection initializing takes about 0.18s.
|
and us dumping/restoring the DDL manually.
5c1db21
to
287a405
Compare
Minio, multithreaded (4 threads): 570 small objects (a lot of waiting on the pg connection since one is shared between all threads): 14.8s upload (~2x faster than wo multithreading)
1M SNAP, 1K DIFF (no gzipping): 13.2s upload, 20s download 100K SNAP, 1K DIFF (no gzipping): 1.8s upload, 2.3s download |
Removes the binary dependency on pg_dump/restore. Much faster for lots of small objects, possibly slightly slower for large objects. Not sure at which point the overhead from gzipping beats the overhead from larger objects.
Benchmarks
Uploading 570 small-ish objects (<20 rows each) upstream (dumping + minio):
BEFORE: bulk of time spent farming out to pg_dump (startup takes about 0.8s)
AFTER: I think it's possible to make this ever so slightly faster by piping directly to Minio (currently we still dump files to /tmp for a 9.5s cost in this trace) + possibly running multiple jobs in parallel.
Clone + download all objects: 40-45s
AFTER WITH GZIP: basically the same
Large object benchmarks
1M SNAP, 10x 1000-row DIFFs:
Sizes: SNAP 82MiB, DIFF 116KiB (pg_dump/restore was 39MiB/45KiB), 41MiB/45KiB with gzipping
Push: 15s (pg_dump/restore was 30s), 45s with gzipping (?????)
Clone + download all: 20s (pg_dump/restore was 15s), 20s with gzipping
100K SNAP, 10x 1000-row DIFFs, with gzipping
Sizes: 4.1MiB/45KiB
Push: 6s (was 8s, probably comparable)
Clone + download all: 3s (was 4s, probably comparable)