-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
push: jobs count -j ignored; jobs count is not 4 for SSH (reason for treeverse/dvc-ssh#16)
I had the same problem as in treeverse/dvc-ssh#16 "SSH: ValueError: Can't create any SFTP connections!" and investigated the problem. The problem is that 4*core_count() parallel threads or connections are used. This is in many cases too much for my SSH/SFTP endpoint. Trying with dvc push -j 1 did not help because the option -j 1 is ignored and still 4*core_count() parallel threads are used.
Description
Reproduce
Make a break point in list_hashes_exists in dvc/objects/db/base.py or add the following logging to the function
max_workers = jobs or self.fs.jobs #this expression is later used
logger.debug(f"jobs = {jobs}; self.fs.jobs = {self.fs.jobs}; max_workers = {max_workers}")# SSH_REMOTE is the name of a SSH/SFTP remote
dvc push -v -r $SSH_REMOTE
dvc push -v -r $SSH_REMOTE -j 1
# S3_REMOTE is the name of a S3 remote
dvc push -v -r $S3_REMOTE -j 1# Output for dvc push -v -r $SSH_REMOTE
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.
# Output for dvc push -v -r $SSH_REMOTE -j 1
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.
# Output for dvc push -v -r $S3_REMOTE -j 1
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.The specified number of jobs is ignored in any case. It is 16=4*core_count() used.
Expected
One expects that the number of jobs used (max_workers) is the number specified with -j or 4 in the SSH case.
# Output for dvc push -v -r $SSH_REMOTE
# In the --help it says: Number of jobs to run simultaneously. The default
# value is 4 * cpu_count(). For SSH remotes, the default
# is 4.
jobs = XXX; self.fs.jobs = XXX; max_workers = 4
Querying XXX hashes via object_exists.
# Output for dvc push -v -r $SSH_REMOTE -j 1
jobs = XXX; self.fs.jobs = XXX; max_workers = 1
Querying XXX hashes via object_exists.
# Output for dvc push -v -r $S3_REMOTE -j 1
jobs = XXX; self.fs.jobs = XXX; max_workers = 1
Querying XXX hashes via object_exists.Environment information
Output of dvc doctor:
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-94-generic-x86_64-with-glibc2.29
Supports:
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.46),
ssh (sshfs = 2021.11.2)
Cache types: hardlink, symlink
Cache directory: XXXXXXXXXXXXXXXXXXXX
Caches: local
Remotes: ssh, s3, s3
Workspace directory: XXXXXXXXXXXXXXXXXXXX
Repo: dvc, gitReason in code and fix
I looked at and debugged the called python functions (see stack trace) below. In multiple functions it was forgotten to pass jobs. Some of these functions do not have a jobs or kwargs parameter. Therefore, I was not sure what your preference is on how to pass jobs down: either by passing it from top to bottom or by setting the attribute jobs of the FileSystem (SSHFileSystem) object. But it should be an easy fix.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/usr/local/lib/python3.8/dist-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/usr/local/lib/python3.8/dist-packages/dvc/command/data_sync.py", line 57, in run
processed_files_count = self.repo.push(
File "/usr/local/lib/python3.8/dist-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/dvc/repo/push.py", line 57, in push
pushed += self.cloud.push(
File "/usr/local/lib/python3.8/dist-packages/dvc/data_cloud.py", line 86, in push
return transfer(
File "/usr/local/lib/python3.8/dist-packages/dvc/objects/transfer.py", line 154, in transfer
status = compare_status(src, dest, obj_ids, check_deleted=False, jobs=jobs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 159, in compare_status
dest_exists, dest_missing = status(
File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 122, in status
exists = hashes.intersection(
File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 36, in _indexed_dir_hashes
indexed_dir_exists.update(odb.list_hashes_exists(indexed_dirs))
File "/usr/local/lib/python3.8/dist-packages/dvc/objects/db/base.py", line 407, in list_hashes_exists