Skip to content

run -d hdfs://: output does not exist #7561

@ykasimov

Description

@ykasimov

Bug Report

Description

I am trying to add an external dependency which is stored on HDFS. I am running it inside this docker image: https://hub.docker.com/r/oneoffcoder/spark-jupyter which has hdfs installed and configured. The command to run the docker is there too.

When I run dvc run -v --force -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv I get the following error:

2022-04-08 14:37:28,188 ERROR: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 154, in run_stage
    stage.repo.stage_cache.restore(stage, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/cache.py", line 182, in restore
    raise RunCacheNotFoundError(stage)
dvc.stage.cache.RunCacheNotFoundError: No run-cache for download_file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/commands/run.py", line 47, in run
    self.repo.run(**kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 152, in run
    return method(repo, *args, **kw)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/run.py", line 33, in run
    stage.run(no_commit=no_commit, run_cache=run_cache)
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 535, in run
    self._run_stage(dry, force, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 553, in _run_stage
    return run_stage(self, dry, force, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 157, in run_stage
    stage.save_deps()
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 468, in save_deps
    dep.save()
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/output.py", line 523, in save
    raise self.DoesNotExistError(self)
dvc.dependency.base.DependencyDoesNotExistError: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
2022-04-08 14:37:28,305 DEBUG: Analytics is enabled.
2022-04-08 14:37:29,056 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'
2022-04-08 14:37:29,074 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'

The file data.csv exists in hdfs:

root@558dae789a66:~/ipynb/dvc-repo# hdfs dfs -ls hdfs://localhost/data.csv
-rw-r--r--   1 root supergroup       2950 2022-04-07 15:42 hdfs://localhost/data.csv

Reproduce

  1. run docker
  2. docker exec -it <container_id> /bin/bash
  3. apt-get update && apt-get install "dvc[hdfs]" git
  4. dvc init
  5. touch data.csv
  6. hdfs dfs -put data.csv hdfs://localhost/data.csv
  7. export CLASSPATH=$CLASSPATH:hdfs classpath --glob
  8. dvc run -v -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv

Expected

I expect that a file dvc.yaml will be created with the external dependency hdfs://localhost/data.csv

Environment information

everything runs inside docker. the only change is

export CLASSPATH=$CLASSPATH:`hdfs classpath --glob`

Output of dvc doctor:

$ dvc doctor

DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.7.11 on Linux-5.10.104-linuxkit-x86_64-with-debian-bullseye-sid
Supports:
        hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: fuse.grpcfuse on grpcfuse
Caches: local, hdfs
Remotes: hdfs, hdfs
Workspace directory: fuse.grpcfuse on grpcfuse
Repo: dvc, git

Additional Information (if any):

The issue is not present in DVC version 2.8.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?fs: hdfsRelated to the HDFS filesystemp1-importantImportant, aka current backlog of things to do

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions