remote: gs/s3: remove batch_exists #2375

pared · 2019-08-07T10:35:26Z

Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.

efiop · 2019-08-07T11:17:43Z

For the record: as we've discussed in PMs, while we are at it, let's try to get rid of batch_exists, since we now have connection pools and no longer need it.

dvc/remote/base.py

dvc/remote/oss.py

dvc/remote/ssh/__init__.py

efiop · 2019-08-07T16:38:44Z

@pared Could you please check how dvc status -c for ssh remote and large directory compares before and after this patch?

pared · 2019-08-08T07:17:03Z

@pared Could you please check how dvc status -c for ssh remote and large directory compares before and after this patch?

Sure, Ill prepare some benchmark.

pared · 2019-08-08T09:33:17Z

@efiop
Here are my results for local SSH server:

Prepare repo script:

#!/bin/bash

rm -rf /tmp/ssh_storage
mkdir /tmp/ssh_storage

rm -rf repo
mkdir repo

cd repo

git init >> /dev/null && dvc init -q

dvc remote add -d ssh_str ssh://user@localhost/tmp/ssh_storage

mkdir data
for i in {1..100000}
do
	echo $i > data/$i
done

dvc add data -q
dvc push -q

rm -rf data
rm -rf .dvc/cache

timing script:

from dvc.repo import Repo
import time
import logging

logger = logging.getLogger("dvc")
logger.setLevel(logging.CRITICAL)

NUM_RUNS=5

repo = Repo("repo")
times = []

for i in range(NUM_RUNS):
    start = time.time()
    repo.status(cloud=True)
    end = time.time()

    times.append(end-start)

print("Average execution time: '{}' for '{}' runs".format(sum(times)/len(times), NUM_RUNS))

Average execution time for 5 runs:
0.54.1 : 27.916 s
master : 28.368 s
pared:2373 (this): 30.198 s

EDIT
Seems like 7% degradation.

Ill run more extensive test.

Suor · 2019-08-08T11:59:48Z

This should be much worse for ssh. You dropped using many sftp per connection, which was a significant optimization.

Suor · 2019-08-08T12:02:04Z

P.S. What was the point of batch exists for gs/s3 in the first place?

We may drop only those while still having batch exists for ssh.

Suor · 2019-08-08T12:04:46Z

@pared did you set no_traverse flag while benching?

Suor · 2019-08-08T12:12:15Z

Also looks like .cache_exists() is broken for http, it only works if no_traverse is set.

efiop · 2019-08-08T12:41:18Z

@Suor

This should be much worse for ssh. You dropped using many sftp per connection, which was a significant optimization.

Can't we use pool there too, same way we do for pull?

P.S. What was the point of batch exists for gs/s3 in the first place?
We may drop only those while still having batch exists for ssh.

It was mainly because of batch_exists for ssh.

@pared did you set no_traverse flag while benching?

It is enabled by default.

Suor · 2019-08-08T12:45:38Z

Can't we use pool there too, same way we do for pull?

Not sure what you mean, add a pool of sftp connections in each SSH connection? This will require special handling anyway, like .open_max_sftp_channels() does now.

efiop · 2019-08-08T12:57:04Z

@Suor Before the connection pool, we had a problem that we were limited by ~4 ssh connections, so we've started using batch_exists which multiplied those 4 by 8 sftp connections. With connection pool in place, we are reusing already opened connections, which is probably why the tests show small performance degradation.

Suor · 2019-08-08T13:00:11Z

@efiop we are still limited by 4 SSH connections with or without pool, if SSH server has many CPUs then this should not be enough.

pared · 2019-08-08T13:24:06Z

Average execution time for 50 repeats:
master : 29.355 s
pared:2373: 31.234 s

efiop · 2019-08-08T13:43:55Z

@Suor but because of the pool, workers can reuse already opened ssh connections instead of opening new ones for each batch and then multiplexing sftp.

Suor · 2019-08-08T14:04:43Z

@efiop Can't see how does this matter.
@pared how many CPUs do you have there? Can you bench with jobs=2?

pared · 2019-08-08T14:05:34Z

@Suor I've got 12. Ill limit and retry tests

Suor · 2019-08-08T14:12:30Z

@pared then something looks wrong, CPUs are not used properly by current master. Maybe it's IO bound for you.

pared · 2019-08-08T14:13:31Z

@Suor maybe I should try with "real" case? Like ssh cache on different physical machine?

Suor · 2019-08-08T14:51:52Z

@pared you can try, it will add a network lag at least, which might also make a number of threads more important.

Suor · 2019-08-08T14:53:19Z

BTW, using ssh ls is 30x faster for me than this cache_exists call.

efiop · 2019-08-08T14:55:15Z

@Suor

BTW, using ssh ls is 30x faster for me than this cache_exists call.

It is known, not using no_traverse is faster on small(compared to local cache) remotes, which is precisely the case here. But no_traverse gives better ui and status time stays almost the same, no matter how many files you have on the remote.

Suor · 2019-08-08T15:03:07Z

Tried the same bench scenario, tried jobs=1 and 2. Looks like there is almost no difference, at least vs local ssh.

Suor · 2019-08-08T15:05:42Z

dvc/remote/base.py

Remove list() call here.

Suor · 2019-08-08T15:12:03Z

So the benches for me, ran dvc status -c, total time:

Current - 50s,
Without batching - 55s,
Without batching and progress bar - 48s. :)

efiop · 2019-08-12T06:58:34Z

Ok, so tested with big latency by checking the status from SF to India. And got 31m vs 1h+(couldn't wait longer and the progress bar is broken on master). So looks like we do need sftp pool too 🙁

efiop · 2019-08-12T07:35:03Z

I've also noticed that it spends around 10minutes before even checking the remote, so there might be something else broken. Need to investigate.

efiop · 2019-08-12T08:10:40Z

Ok, guys, how about we re-define RemoteSSH.cache_exists with the old version of RemoteBASE.cache_exists, so it could work as it did before, while not affecting the other remotes. And we should create a ticket for SFTP pool to get rid of batch_exists in RemoteSSH later, which would probably speed up push/pull as well btw.

pared · 2019-08-12T08:20:06Z

@efiop Ill retrieve previous version of cache exists for SSH then.

efiop · 2019-08-12T19:45:15Z

dvc/remote/base.py

        progress_callback = ProgressCallback(len(checksums))

-        def exists_with_progress(chunks):
-            return self.batch_exists(chunks, callback=progress_callback)


We've lost the progress bar :)

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

efiop

Thanks!

pared changed the title ~~[WIP] remote: azure: implement batch_exists~~ [WIP] remote: gs/s3: remove batch_exists Aug 7, 2019

pared force-pushed the 2373 branch from 0e44534 to b6d8c49 Compare August 7, 2019 12:56

efiop suggested changes Aug 7, 2019

View reviewed changes

dvc/remote/base.py Outdated Show resolved Hide resolved

efiop reviewed Aug 7, 2019

View reviewed changes

dvc/remote/oss.py Outdated Show resolved Hide resolved

efiop reviewed Aug 7, 2019

View reviewed changes

dvc/remote/ssh/__init__.py Outdated Show resolved Hide resolved

efiop mentioned this pull request Aug 8, 2019

remote: implement Google Drive #2040

Closed

2 tasks

pared changed the title ~~[WIP] remote: gs/s3: remove batch_exists~~ remote: gs/s3: remove batch_exists Aug 8, 2019

pared mentioned this pull request Aug 8, 2019

Fix cache_exists for RemoteHTTP #2380

Closed

Suor suggested changes Aug 8, 2019

View reviewed changes

pared requested review from Suor and efiop August 9, 2019 09:56

pared added 6 commits August 12, 2019 10:59

remote: azure: implement batch_exists

71abef7

remote: base: modify cache_exists to use exists instead of batch_exists

237f3e2

remote: azure: make blob_service cached property

98de99f

remote: oss: fix _list_cache_args

132b571

remote: ssh: connection: remove open_max_sftp_channels

394d6ff

remote: base: remove unnecessary list casting

f224790

pared force-pushed the 2373 branch 2 times, most recently from dad1a4e to fadf47c Compare August 12, 2019 11:58

remote: ssh: reintroduce batch_exists

8233f95

pared force-pushed the 2373 branch from fadf47c to 8233f95 Compare August 12, 2019 12:10

efiop reviewed Aug 12, 2019

View reviewed changes

Ruslan Kuprieiev added 2 commits August 13, 2019 02:08

remote: base: don't forget about the progress bar

c308ad8

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

remote: base: remove whitespace

06d3471

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

efiop approved these changes Aug 12, 2019

View reviewed changes

efiop merged commit 3095b24 into treeverse:master Aug 12, 2019

efiop mentioned this pull request Aug 13, 2019

ssh: use sftp pool #2390

Closed

Suor mentioned this pull request Aug 13, 2019

change progress bar backend to tqdm #2333

Merged

35 tasks

pared deleted the 2373 branch December 17, 2019 13:14

remote: gs/s3: remove batch_exists #2375

remote: gs/s3: remove batch_exists #2375

Uh oh!

Conversation

pared commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Aug 7, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

efiop commented Aug 7, 2019

Uh oh!

pared commented Aug 8, 2019

Uh oh!

pared commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

efiop commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

efiop commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

pared commented Aug 8, 2019

Uh oh!

efiop commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

pared commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

pared commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

efiop commented Aug 8, 2019

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

Suor Aug 8, 2019

Choose a reason for hiding this comment

Uh oh!

pared Aug 9, 2019

Choose a reason for hiding this comment

Uh oh!

Suor commented Aug 8, 2019

Uh oh!

efiop commented Aug 12, 2019

Uh oh!

efiop commented Aug 12, 2019

Uh oh!

efiop commented Aug 12, 2019

Uh oh!

pared commented Aug 12, 2019

Uh oh!

efiop Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

efiop left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

pared commented Aug 7, 2019 •

edited

Loading

pared commented Aug 8, 2019 •

edited

Loading

Suor commented Aug 8, 2019 •

edited

Loading

Suor commented Aug 8, 2019 •

edited

Loading