Skip to content

Conversation

@pared
Copy link
Contributor

@pared pared commented Aug 7, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fix #2373

@efiop
Copy link
Contributor

efiop commented Aug 7, 2019

For the record: as we've discussed in PMs, while we are at it, let's try to get rid of batch_exists, since we now have connection pools and no longer need it.

@pared pared changed the title [WIP] remote: azure: implement batch_exists [WIP] remote: gs/s3: remove batch_exists Aug 7, 2019
@efiop
Copy link
Contributor

efiop commented Aug 7, 2019

@pared Could you please check how dvc status -c for ssh remote and large directory compares before and after this patch?

@efiop efiop mentioned this pull request Aug 8, 2019
2 tasks
@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

@pared Could you please check how dvc status -c for ssh remote and large directory compares before and after this patch?

Sure, Ill prepare some benchmark.

@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

@efiop
Here are my results for local SSH server:

Prepare repo script:

#!/bin/bash

rm -rf /tmp/ssh_storage
mkdir /tmp/ssh_storage

rm -rf repo
mkdir repo

cd repo

git init >> /dev/null && dvc init -q

dvc remote add -d ssh_str ssh://user@localhost/tmp/ssh_storage

mkdir data
for i in {1..100000}
do
	echo $i > data/$i
done

dvc add data -q
dvc push -q

rm -rf data
rm -rf .dvc/cache

timing script:

from dvc.repo import Repo
import time
import logging

logger = logging.getLogger("dvc")
logger.setLevel(logging.CRITICAL)

NUM_RUNS=5

repo = Repo("repo")
times = []

for i in range(NUM_RUNS):
    start = time.time()
    repo.status(cloud=True)
    end = time.time()

    times.append(end-start)

print("Average execution time: '{}' for '{}' runs".format(sum(times)/len(times), NUM_RUNS))

Average execution time for 5 runs:
0.54.1 : 27.916 s
master : 28.368 s
pared:2373 (this): 30.198 s

EDIT
Seems like 7% degradation.

Ill run more extensive test.

@pared pared changed the title [WIP] remote: gs/s3: remove batch_exists remote: gs/s3: remove batch_exists Aug 8, 2019
@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

This should be much worse for ssh. You dropped using many sftp per connection, which was a significant optimization.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

P.S. What was the point of batch exists for gs/s3 in the first place?

We may drop only those while still having batch exists for ssh.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

@pared did you set no_traverse flag while benching?

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

Also looks like .cache_exists() is broken for http, it only works if no_traverse is set.

@efiop
Copy link
Contributor

efiop commented Aug 8, 2019

@Suor

This should be much worse for ssh. You dropped using many sftp per connection, which was a significant optimization.

Can't we use pool there too, same way we do for pull?

P.S. What was the point of batch exists for gs/s3 in the first place?
We may drop only those while still having batch exists for ssh.

It was mainly because of batch_exists for ssh.

@pared did you set no_traverse flag while benching?

It is enabled by default.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

Can't we use pool there too, same way we do for pull?

Not sure what you mean, add a pool of sftp connections in each SSH connection? This will require special handling anyway, like .open_max_sftp_channels() does now.

@efiop
Copy link
Contributor

efiop commented Aug 8, 2019

@Suor Before the connection pool, we had a problem that we were limited by ~4 ssh connections, so we've started using batch_exists which multiplied those 4 by 8 sftp connections. With connection pool in place, we are reusing already opened connections, which is probably why the tests show small performance degradation.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

@efiop we are still limited by 4 SSH connections with or without pool, if SSH server has many CPUs then this should not be enough.

@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

Average execution time for 50 repeats:
master : 29.355 s
pared:2373: 31.234 s

@efiop
Copy link
Contributor

efiop commented Aug 8, 2019

@Suor but because of the pool, workers can reuse already opened ssh connections instead of opening new ones for each batch and then multiplexing sftp.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

@efiop Can't see how does this matter.
@pared how many CPUs do you have there? Can you bench with jobs=2?

@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

@Suor I've got 12. Ill limit and retry tests

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

@pared then something looks wrong, CPUs are not used properly by current master. Maybe it's IO bound for you.

@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

@Suor maybe I should try with "real" case? Like ssh cache on different physical machine?

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

@pared you can try, it will add a network lag at least, which might also make a number of threads more important.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

BTW, using ssh ls is 30x faster for me than this cache_exists call.

@efiop
Copy link
Contributor

efiop commented Aug 8, 2019

@Suor

BTW, using ssh ls is 30x faster for me than this cache_exists call.

It is known, not using no_traverse is faster on small(compared to local cache) remotes, which is precisely the case here. But no_traverse gives better ui and status time stays almost the same, no matter how many files you have on the remote.

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

Tried the same bench scenario, tried jobs=1 and 2. Looks like there is almost no difference, at least vs local ssh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove list() call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Suor
Copy link
Contributor

Suor commented Aug 8, 2019

So the benches for me, ran dvc status -c, total time:

Current - 50s,
Without batching - 55s,
Without batching and progress bar - 48s. :)

@pared pared requested review from Suor and efiop August 9, 2019 09:56
@efiop
Copy link
Contributor

efiop commented Aug 12, 2019

Ok, so tested with big latency by checking the status from SF to India. And got 31m vs 1h+(couldn't wait longer and the progress bar is broken on master). So looks like we do need sftp pool too 🙁

@efiop
Copy link
Contributor

efiop commented Aug 12, 2019

I've also noticed that it spends around 10minutes before even checking the remote, so there might be something else broken. Need to investigate.

@efiop
Copy link
Contributor

efiop commented Aug 12, 2019

Ok, guys, how about we re-define RemoteSSH.cache_exists with the old version of RemoteBASE.cache_exists, so it could work as it did before, while not affecting the other remotes. And we should create a ticket for SFTP pool to get rid of batch_exists in RemoteSSH later, which would probably speed up push/pull as well btw.

@pared
Copy link
Contributor Author

pared commented Aug 12, 2019

@efiop Ill retrieve previous version of cache exists for SSH then.

@pared pared force-pushed the 2373 branch 2 times, most recently from dad1a4e to fadf47c Compare August 12, 2019 11:58
progress_callback = ProgressCallback(len(checksums))

def exists_with_progress(chunks):
return self.batch_exists(chunks, callback=progress_callback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the progress bar :)

Ruslan Kuprieiev added 2 commits August 13, 2019 02:08
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit 3095b24 into treeverse:master Aug 12, 2019
@efiop efiop mentioned this pull request Aug 13, 2019
@Suor Suor mentioned this pull request Aug 13, 2019
35 tasks
@pared pared deleted the 2373 branch December 17, 2019 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collecting information from remote cache very slow

3 participants