Skip to content

Error on an attempt to dvc get cats-dogs-v[1/2] with dvc>=3.0.0 #9767

@ankxyz

Description

@ankxyz

Bug Report

Validation error on an attempt to download dataset cats-dogs-v[1/2]

Description

When I try to download datasets cats-dogs-v1 or cats-dogs-v2 using command dvc get with option --rev I get an error.

Environment

  • OS: Ubuntu 22.04 LTS
  • Python: 3.8.10/3.11.2
  • Package manager: pip==23.2.1
  • Virtual environment: standard venv
  • DVC: 3.x

Reproduce

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Upgrade basic packages
pip install --upgrade pip setuptools wheel

# Install DVC
pip install dvc==3.7.0

# Try to download dataset
dvc get --rev cats-dogs-v1 \
    https://github.com/iterative/dataset-registry \
    use-cases/cats-dogs -o datadir

error traceback

../../../../../get-started/data.xml.dvc' validation failed in revision         
'0547f58'.

extra keys not allowed, in outs -> 0 -> metric, line 4, column 3
  3 outs:                                                                       
  4 - md5: a304afb96060aad90176268345e10355                                     
  5   path: get-started/data.xml

The same download command with option -v:

dvc get --rev cats-dogs-v1 \
    https://github.com/iterative/dataset-registry \
    use-cases/cats-dogs -o datadir -v

verbose error traceback

2023-07-26 13:24:46,495 DEBUG: v3.7.0 (pip), CPython 3.11.2 on Linux-5.15.0-71-generic-x86_64-with-glibc2.31
2023-07-26 13:24:46,495 DEBUG: command: /media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/bin/dvc get --rev cats-dogs-v1 https://github.com/iterative/dataset-registry use-cases/cats-dogs -o datadir -v
2023-07-26 13:24:46,652 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@cats-dogs-v1
2023-07-26 13:24:46,652 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry' to a temporary dir
2023-07-26 13:24:48,637 DEBUG: erepo: using shallow clone for branch 'cats-dogs-v1'                                                                                                                                
'../../../../../get-started/data.xml.dvc' validation failed in revision '0547f58'.

extra keys not allowed, in outs -> 0 -> metric, line 4, column 3
  3 outs:                                                                                                                                                                                                          
  4 - md5: a304afb96060aad90176268345e10355                                                                                                                                                                        
  5   path: get-started/data.xml                                                                                                                                                                                   
2023-07-26 13:24:48,678 ERROR: failed to get 'use-cases/cats-dogs' from 'https://github.com/iterative/dataset-registry' - '../../../../../get-started/data.xml.dvc' validation failed in revision '0547f58': extra keys not allowed @ data['outs'][0]['metric']
Traceback (most recent call last):
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/utils/strictyaml.py", line 267, in validate
    return schema(data)
           ^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/voluptuous/schema_builder.py", line 272, in __call__
    return self._compiled([], data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/voluptuous/schema_builder.py", line 595, in validate_dict
    return base_validate(path, iteritems(data), out)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/voluptuous/schema_builder.py", line 433, in validate_mapping
    raise er.MultipleInvalid(errors)
voluptuous.error.MultipleInvalid: extra keys not allowed @ data['outs'][0]['metric']

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/commands/get.py", line 33, in _get_file_from_repo
    Repo.get(
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/repo/get.py", line 54, in get
    desc=f"Downloading {fs.path.name(path)}",
                        ^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/fs/dvc.py", line 423, in path
    return self.fs.path
           ^^^^^^^
  File "/home/alex/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/fs/dvc.py", line 416, in fs
    return _DVCFileSystem(**self.fs_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/fs/dvc.py", line 163, in __init__
    self._datafss[key] = DataFileSystem(index=repo.index.data["repo"])
                                              ^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
                                                  ^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/repo/__init__.py", line 276, in index
    return Index.from_repo(self)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/repo/index.py", line 242, in from_repo
    for _, idx in collect_files(repo, onerror=onerror):
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/repo/index.py", line 100, in collect_files
    index = Index.from_file(repo, file_path)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/repo/index.py", line 265, in from_file
    stages=list(dvcfile.stages.values()),
                ^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/dvcfile.py", line 197, in stages
    data, raw = self._load()
                ^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/dvcfile.py", line 151, in _load
    return self._load_yaml(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/dvcfile.py", line 162, in _load_yaml
    return strictyaml.load(
           ^^^^^^^^^^^^^^^^
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/utils/strictyaml.py", line 295, in load
    validate(data, schema, text=text, path=path, rev=rev)
  File "/media/alex/hdd/tmp/tst_dvc_rgstr_dpnd_vrsn/.venv/lib/python3.11/site-packages/dvc/utils/strictyaml.py", line 269, in validate
    raise YAMLValidationError(exc, path, text, rev=rev) from exc
dvc.utils.strictyaml.YAMLValidationError: '../../../../../get-started/data.xml.dvc' validation failed in revision '0547f58'

2023-07-26 13:24:48,715 DEBUG: Analytics is enabled.
2023-07-26 13:24:48,742 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpobvfq_w1']'
2023-07-26 13:24:48,743 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpobvfq_w1']'

Notes:

  • the error occurs with different Python versions if dvc>==3.0.0; the latest working version is dvc==2.58.2
  • the same error with dataset get-started:
dvc get --rev get-started https://github.com/iterative/dataset-registry  get-started  -o datadir
  • the error does not occur with dataset get-started-40K:
dvc get --rev get-started-40K https://github.com/iterative/dataset-registry use-cases/cats-dogs -o datadir

Metadata

Metadata

Assignees

No one assigned

    Labels

    awaiting responsewe are waiting for your reply, please respond! :)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions