Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed pg_dump/restore shelling out to using postgres' COPY TO/FROM #18

Merged
merged 4 commits into from Nov 26, 2018

Conversation

mildbyte
Copy link
Contributor

Removes the binary dependency on pg_dump/restore. Much faster for lots of small objects, possibly slightly slower for large objects. Not sure at which point the overhead from gzipping beats the overhead from larger objects.

Benchmarks

Uploading 570 small-ish objects (<20 rows each) upstream (dumping + minio):

BEFORE: bulk of time spent farming out to pg_dump (startup takes about 0.8s)

Thu Nov 22 11:36:58 2018    push_s3_current.cprofile
         1627726 function calls (1627725 primitive calls) in 524.916 seconds
   Ordered by: cumulative time
   List reduced from 567 to 20 due to restriction <20>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  524.926  524.926 {built-in method builtins.exec}
        1    0.000    0.000  524.926  524.926 <string>:1(<module>)
        1    0.001    0.001  524.925  524.925 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:176(push)
        1    0.002    0.002  522.233  522.233 /home/mildbyte/splitgraph-opensource/splitgraph/objects/loading.py:83(upload_objects)
        1    0.032    0.032  522.221  522.221 /home/mildbyte/splitgraph-opensource/splitgraph/objects/s3.py:20(s3_upload_objects)
      570    0.144    0.000  508.173    0.892 /home/mildbyte/splitgraph-opensource/splitgraph/objects/dumping.py:14(dump_object_to_file)
      570    0.028    0.000  507.636    0.891 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/subprocess.py:295(check_output)
      570    0.052    0.000  507.604    0.891 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/subprocess.py:372(run)
      570    0.026    0.000  500.243    0.878 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/subprocess.py:803(communicate)
     1144  500.060    0.437  500.060    0.437 {method 'read' of '_io.BufferedReader' objects}
      570    0.043    0.000   13.967    0.025 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:522(fput_object)
      570    0.031    0.000   13.871    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:723(put_object)
      570    0.026    0.000   13.777    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1420(_do_put_object)
      570    0.064    0.000   13.668    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1762(_url_open)
      572    0.037    0.000   12.778    0.022 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/poolmanager.py:301(urlopen)
      572    0.057    0.000   12.513    0.022 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:446(urlopen)
      572    0.103    0.000   11.804    0.021 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:319(_make_request)
      572    0.037    0.000   10.471    0.018 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:1287(getresponse)
      572    0.060    0.000   10.242    0.018 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:290(begin)
      572    0.038    0.000    9.507    0.017 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:257(_read_status)

AFTER: I think it's possible to make this ever so slightly faster by piping directly to Minio (currently we still dump files to /tmp for a 9.5s cost in this trace) + possibly running multiple jobs in parallel.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.853   28.853 {built-in method builtins.exec}
        1    0.001    0.001   28.853   28.853 <string>:1(<module>)
        1    0.001    0.001   28.849   28.849 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:176(push)
        1    0.001    0.001   25.690   25.690 /home/mildbyte/splitgraph-opensource/splitgraph/objects/loading.py:83(upload_objects)
        1    0.017    0.017   25.687   25.687 /home/mildbyte/splitgraph-opensource/splitgraph/objects/s3.py:20(s3_upload_objects)
      570    0.029    0.000   16.058    0.028 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:522(fput_object)
      570    0.023    0.000   15.989    0.028 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:723(put_object)
      570    0.018    0.000   15.915    0.028 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1420(_do_put_object)
      570    0.029    0.000   15.854    0.028 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1762(_url_open)
      572    0.020    0.000   15.272    0.027 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/poolmanager.py:301(urlopen)
      572    0.042    0.000   15.129    0.026 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:446(urlopen)
      572    0.049    0.000   14.666    0.026 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:319(_make_request)
      572    0.013    0.000   14.029    0.025 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:1287(getresponse)
      572    0.040    0.000   13.952    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:290(begin)
      572    0.030    0.000   13.381    0.023 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:257(_read_status)
     6298    0.060    0.000   13.371    0.002 {method 'readline' of '_io.BufferedReader' objects}
      572    0.009    0.000   13.311    0.023 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/socket.py:572(readinto)
      572   13.297    0.023   13.297    0.023 {method 'recv_into' of '_socket.socket' objects}
     2319   10.242    0.004   10.522    0.005 {method 'execute' of 'psycopg2.extensions.cursor' objects}
      570    0.097    0.000    9.574    0.017 /home/mildbyte/splitgraph-opensource/splitgraph/objects/dumping.py:14(dump_object_to_file)

Clone + download all objects: 40-45s

AFTER WITH GZIP: basically the same

Thu Nov 22 14:07:04 2018    push_s3_copy_to_gzip.cprofile
         1205720 function calls (1205719 primitive calls) in 28.582 seconds
   Ordered by: cumulative time
   List reduced from 579 to 20 due to restriction <20>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.600   28.600 {built-in method builtins.exec}
        1    0.001    0.001   28.600   28.600 <string>:1(<module>)
        1    0.001    0.001   28.599   28.599 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:176(push)
        1    0.001    0.001   25.588   25.588 /home/mildbyte/splitgraph-opensource/splitgraph/objects/loading.py:83(upload_objects)
        1    0.018    0.018   25.583   25.583 /home/mildbyte/splitgraph-opensource/splitgraph/objects/s3.py:20(s3_upload_objects)
      570    0.027    0.000   16.407    0.029 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:522(fput_object)
      570    0.019    0.000   16.343    0.029 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:723(put_object)
      570    0.018    0.000   16.288    0.029 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1420(_do_put_object)
      570    0.026    0.000   16.223    0.028 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/minio-4.0.6-py3.6.egg/minio/api.py:1762(_url_open)
      572    0.018    0.000   15.687    0.027 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/poolmanager.py:301(urlopen)
      572    0.037    0.000   15.552    0.027 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:446(urlopen)
      572    0.048    0.000   15.055    0.026 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/urllib3/connectionpool.py:319(_make_request)
      572    0.018    0.000   14.370    0.025 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:1287(getresponse)
      572    0.034    0.000   14.264    0.025 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:290(begin)
      572    0.033    0.000   13.693    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/http/client.py:257(_read_status)
     6298    0.063    0.000   13.686    0.002 {method 'readline' of '_io.BufferedReader' objects}
      572    0.011    0.000   13.623    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/socket.py:572(readinto)
      572   13.607    0.024   13.607    0.024 {method 'recv_into' of '_socket.socket' objects}
     2319    9.451    0.004    9.703    0.004 {method 'execute' of 'psycopg2.extensions.cursor' objects}
      570    0.053    0.000    9.118    0.016 /home/mildbyte/splitgraph-opensource/splitgraph/objects/dumping.py:16(dump_object_to_file)

Large object benchmarks

1M SNAP, 10x 1000-row DIFFs:

Sizes: SNAP 82MiB, DIFF 116KiB (pg_dump/restore was 39MiB/45KiB), 41MiB/45KiB with gzipping

Push: 15s (pg_dump/restore was 30s), 45s with gzipping (?????)

Clone + download all: 20s (pg_dump/restore was 15s), 20s with gzipping

100K SNAP, 10x 1000-row DIFFs, with gzipping

Sizes: 4.1MiB/45KiB

Push: 6s (was 8s, probably comparable)

Clone + download all: 3s (was 4s, probably comparable)

@coveralls
Copy link

coveralls commented Nov 22, 2018

Coverage Status

Coverage increased (+0.07%) to 94.859% when pulling b10fb4c on bugfix/faster-object-dumping into 6d42002 on master.

@mildbyte
Copy link
Contributor Author

Looks like we can't pipe the object directly to S3 (or an S3-like storage) because the API needs to know the size of the object in advance before it's uploaded: minio/mc#2271 minio/minio-py#650

@mildbyte
Copy link
Contributor Author

rebased on experiment/poetry so that I can play around with alternative S3 clients

@mildbyte
Copy link
Contributor Author

Piping directly to smart_open -> boto instead of minio: slower, since smart_open's connection initializing takes about 0.18s.

Fri Nov 23 11:13:07 2018    push_s3_boto_streaming.cprofile
         36774783 function calls (36168224 primitive calls) in 150.104 seconds
   Ordered by: cumulative time
   List reduced from 1354 to 20 due to restriction <20>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     18/1    0.001    0.000  150.359  150.359 {built-in method builtins.exec}
        1    0.001    0.001  150.359  150.359 <string>:1(<module>)
        1    0.001    0.001  150.358  150.358 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:176(push)
        1    0.001    0.001  147.593  147.593 /home/mildbyte/splitgraph-opensource/splitgraph/objects/loading.py:83(upload_objects)
        1    0.474    0.474  147.587  147.587 /home/mildbyte/splitgraph-opensource/splitgraph/objects/s3.py:22(s3_upload_objects)
      570    0.019    0.000  106.318    0.187 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/smart_open/smart_open_lib.py:114(smart_open)
      570    0.011    0.000  106.177    0.186 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/smart_open/smart_open_lib.py:307(_open_binary_stream)
      570    0.011    0.000  106.138    0.186 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/smart_open/smart_open_lib.py:383(_s3_open_uri)
      570    0.018    0.000  106.123    0.186 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/smart_open/s3.py:60(open)
      570    0.039    0.000  106.099    0.186 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/smart_open/s3.py:349(__init__)
      570    0.038    0.000   55.077    0.097 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/boto3/session.py:265(resource)
     2281    0.060    0.000   43.716    0.019 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/client.py:312(_api_call)
     2281    0.142    0.000   43.636    0.019 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/client.py:578(_make_api_call)
      571    0.006    0.000   41.584    0.073 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/boto3/session.py:185(client)
      571    0.032    0.000   41.578    0.073 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/session.py:762(create_client)
     1710    0.017    0.000   41.440    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/boto3/resources/factory.py:519(do_action)
     1710    0.066    0.000   41.417    0.024 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/boto3/resources/action.py:62(__call__)
7418/2853    0.079    0.000   40.187    0.014 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/loaders.py:126(_wrapper)
     2281    0.030    0.000   36.463    0.016 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/endpoint.py:99(make_request)
     2281    0.034    0.000   36.422    0.016 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/botocore/endpoint.py:130(_send_request)

@mildbyte
Copy link
Contributor Author

Minio, multithreaded (4 threads):

570 small objects (a lot of waiting on the pg connection since one is shared between all threads): 14.8s upload (~2x faster than wo multithreading)

Fri Nov 23 12:29:43 2018    push_minio_streaming_4threads.cprofile
         394700 function calls (394474 primitive calls) in 14.880 seconds
   Ordered by: cumulative time
   List reduced from 579 to 20 due to restriction <20>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.883   14.883 {built-in method builtins.exec}
        1    0.001    0.001   14.883   14.883 <string>:1(<module>)
        1    0.001    0.001   14.882   14.882 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:176(push)
        1    0.001    0.001   12.450   12.450 /home/mildbyte/splitgraph-opensource/splitgraph/objects/loading.py:83(upload_objects)
        1    0.000    0.000   12.437   12.437 /home/mildbyte/splitgraph-opensource/splitgraph/objects/s3.py:23(s3_upload_objects)
      605   12.382    0.020   12.382    0.020 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000   12.381   12.381 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/concurrent/futures/_base.py:610(__exit__)
        1    0.000    0.000   12.381   12.381 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/concurrent/futures/thread.py:146(shutdown)
        4    0.000    0.000   12.381    3.095 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/threading.py:1024(join)
        4    0.000    0.000   12.381    3.095 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/threading.py:1062(_wait_for_tstate_lock)
        1    0.024    0.024    1.964    1.964 /home/mildbyte/splitgraph-opensource/splitgraph/commands/push_pull.py:18(_get_required_snaps_objects)
     1179    1.747    0.001    1.870    0.002 {method 'execute' of 'psycopg2.extensions.cursor' objects}
      571    0.017    0.000    0.989    0.002 /home/mildbyte/splitgraph-opensource/splitgraph/meta_handler/images.py:76(add_new_image)
        4    0.001    0.000    0.425    0.106 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/psycopg2/extras.py:1181(execute_batch)
      576    0.023    0.000    0.213    0.000 /home/mildbyte/splitgraph-opensource/splitgraph/meta_handler/common.py:112(insert)
        2    0.000    0.000    0.189    0.094 /home/mildbyte/splitgraph-opensource/splitgraph/meta_handler/objects.py:54(register_object_locations)
     3434    0.064    0.000    0.186    0.000 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/psycopg2/sql.py:119(as_string)
       24    0.003    0.000    0.182    0.008 /home/mildbyte/miniconda3/envs/splitgraph-prototype/lib/python3.6/site-packages/psycopg2/extras.py:1202(<listcomp>)
     2280    0.053    0.000    0.179    0.000 {method 'mogrify' of 'psycopg2.extensions.cursor' objects}
        1    0.000    0.000    0.146    0.146 /home/mildbyte/splitgraph-opensource/splitgraph/meta_handler/objects.py:47(register_tables)

1M SNAP, 1K DIFF (no gzipping): 13.2s upload, 20s download

100K SNAP, 1K DIFF (no gzipping): 1.8s upload, 2.3s download

@mildbyte mildbyte merged commit a6e3b83 into master Nov 26, 2018
@mildbyte mildbyte deleted the bugfix/faster-object-dumping branch November 26, 2018 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants