-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
bugDid we break something?Did we break something?fs: hdfsRelated to the HDFS filesystemRelated to the HDFS filesystemp1-importantImportant, aka current backlog of things to doImportant, aka current backlog of things to do
Description
Bug Report
Description
I am trying to add an external dependency which is stored on HDFS. I am running it inside this docker image: https://hub.docker.com/r/oneoffcoder/spark-jupyter which has hdfs installed and configured. The command to run the docker is there too.
When I run dvc run -v --force -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv I get the following error:
2022-04-08 14:37:28,188 ERROR: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 154, in run_stage
stage.repo.stage_cache.restore(stage, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/cache.py", line 182, in restore
raise RunCacheNotFoundError(stage)
dvc.stage.cache.RunCacheNotFoundError: No run-cache for download_file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/conda/lib/python3.7/site-packages/dvc/commands/run.py", line 47, in run
self.repo.run(**kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 152, in run
return method(repo, *args, **kw)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/run.py", line 33, in run
stage.run(no_commit=no_commit, run_cache=run_cache)
File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
return call()
File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 535, in run
self._run_stage(dry, force, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
return call()
File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 553, in _run_stage
return run_stage(self, dry, force, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 157, in run_stage
stage.save_deps()
File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 468, in save_deps
dep.save()
File "/usr/local/conda/lib/python3.7/site-packages/dvc/output.py", line 523, in save
raise self.DoesNotExistError(self)
dvc.dependency.base.DependencyDoesNotExistError: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
2022-04-08 14:37:28,305 DEBUG: Analytics is enabled.
2022-04-08 14:37:29,056 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'
2022-04-08 14:37:29,074 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'
The file data.csv exists in hdfs:
root@558dae789a66:~/ipynb/dvc-repo# hdfs dfs -ls hdfs://localhost/data.csv
-rw-r--r-- 1 root supergroup 2950 2022-04-07 15:42 hdfs://localhost/data.csv
Reproduce
- run docker
docker exec -it <container_id> /bin/bashapt-get update && apt-get install "dvc[hdfs]" gitdvc inittouch data.csv- hdfs dfs -put data.csv hdfs://localhost/data.csv
- export CLASSPATH=$CLASSPATH:
hdfs classpath --glob - dvc run -v -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv
Expected
I expect that a file dvc.yaml will be created with the external dependency hdfs://localhost/data.csv
Environment information
everything runs inside docker. the only change is
export CLASSPATH=$CLASSPATH:`hdfs classpath --glob`
Output of dvc doctor:
$ dvc doctor
DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.7.11 on Linux-5.10.104-linuxkit-x86_64-with-debian-bullseye-sid
Supports:
hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: fuse.grpcfuse on grpcfuse
Caches: local, hdfs
Remotes: hdfs, hdfs
Workspace directory: fuse.grpcfuse on grpcfuse
Repo: dvc, gitAdditional Information (if any):
The issue is not present in DVC version 2.8.3
Metadata
Metadata
Assignees
Labels
bugDid we break something?Did we break something?fs: hdfsRelated to the HDFS filesystemRelated to the HDFS filesystemp1-importantImportant, aka current backlog of things to doImportant, aka current backlog of things to do