Skip to content

push: jobs count -j ignored; jobs count is not 4 for SSH (reason for #6856) #7319

@mistermult

Description

@mistermult

Bug Report

push: jobs count -j ignored; jobs count is not 4 for SSH (reason for treeverse/dvc-ssh#16)

I had the same problem as in treeverse/dvc-ssh#16 "SSH: ValueError: Can't create any SFTP connections!" and investigated the problem. The problem is that 4*core_count() parallel threads or connections are used. This is in many cases too much for my SSH/SFTP endpoint. Trying with dvc push -j 1 did not help because the option -j 1 is ignored and still 4*core_count() parallel threads are used.

Description

Reproduce

Make a break point in list_hashes_exists in dvc/objects/db/base.py or add the following logging to the function

max_workers = jobs or self.fs.jobs #this expression is later used
logger.debug(f"jobs = {jobs}; self.fs.jobs = {self.fs.jobs}; max_workers = {max_workers}")
# SSH_REMOTE is the name of a SSH/SFTP remote
dvc push -v -r $SSH_REMOTE
dvc push -v -r $SSH_REMOTE -j 1

# S3_REMOTE is the name of a S3 remote
dvc push -v -r $S3_REMOTE -j 1
# Output for dvc push -v -r $SSH_REMOTE
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.

# Output for dvc push -v -r $SSH_REMOTE -j 1
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.

# Output for dvc push -v -r $S3_REMOTE -j 1
jobs = None; self.fs.jobs = 16; max_workers = 16
Querying XXX hashes via object_exists.

The specified number of jobs is ignored in any case. It is 16=4*core_count() used.

Expected

One expects that the number of jobs used (max_workers) is the number specified with -j or 4 in the SSH case.

# Output for dvc push -v -r $SSH_REMOTE
# In the --help it says: Number of jobs to run simultaneously. The default
# value is 4 * cpu_count(). For SSH remotes, the default
# is 4.
jobs = XXX; self.fs.jobs = XXX; max_workers = 4
Querying XXX hashes via object_exists.

# Output for dvc push -v -r $SSH_REMOTE -j 1
jobs = XXX; self.fs.jobs = XXX; max_workers = 1
Querying XXX hashes via object_exists.

# Output for dvc push -v -r $S3_REMOTE -j 1
jobs = XXX; self.fs.jobs = XXX; max_workers = 1
Querying XXX hashes via object_exists.

Environment information

Output of dvc doctor:

DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-94-generic-x86_64-with-glibc2.29
Supports:
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.1.0, boto3 = 1.20.46),
        ssh (sshfs = 2021.11.2)
Cache types: hardlink, symlink
Cache directory: XXXXXXXXXXXXXXXXXXXX
Caches: local
Remotes: ssh, s3, s3
Workspace directory: XXXXXXXXXXXXXXXXXXXX
Repo: dvc, git

Reason in code and fix

I looked at and debugged the called python functions (see stack trace) below. In multiple functions it was forgotten to pass jobs. Some of these functions do not have a jobs or kwargs parameter. Therefore, I was not sure what your preference is on how to pass jobs down: either by passing it from top to bottom or by setting the attribute jobs of the FileSystem (SSHFileSystem) object. But it should be an easy fix.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/usr/local/lib/python3.8/dist-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/usr/local/lib/python3.8/dist-packages/dvc/command/data_sync.py", line 57, in run
    processed_files_count = self.repo.push(
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/push.py", line 57, in push
    pushed += self.cloud.push(
  File "/usr/local/lib/python3.8/dist-packages/dvc/data_cloud.py", line 86, in push
    return transfer(
  File "/usr/local/lib/python3.8/dist-packages/dvc/objects/transfer.py", line 154, in transfer
    status = compare_status(src, dest, obj_ids, check_deleted=False, jobs=jobs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 159, in compare_status
    dest_exists, dest_missing = status(
  File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 122, in status
    exists = hashes.intersection(
  File "/usr/local/lib/python3.8/dist-packages/dvc/objects/status.py", line 36, in _indexed_dir_hashes
    indexed_dir_exists.update(odb.list_hashes_exists(indexed_dirs))
  File "/usr/local/lib/python3.8/dist-packages/dvc/objects/db/base.py", line 407, in list_hashes_exists

Metadata

Metadata

Assignees

Labels

A: data-syncRelated to dvc get/fetch/import/pull/pushbugDid we break something?fs: sshRelated to the SSH filesystem

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions