Skip to content

exp run: cannot clean up temp directory runs on Linux + NFS #5641

@werthen

Description

@werthen

Bug Report

Description

When using the exp run --run-all feature, the command can never finish due to a error deleting a git folder, even though that folder seems to be empty.

Reproduce

  1. dvc exp run --queue -S training.loss.name=mse
  2. dvc exp run --run-all -j1 --verbose

Expected

I expect the same output as running dvc exp run -S training.loss.name=mse, being a successful experiment with proper metrics shown in dvc exp show. Instead, when using the queueing functionality, the command errors and the metrics are not properly saved. The JSON representation of dvc exp show also states the experiment is still queued.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.0.6 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-60-generic-x86_64-with-glibc2.10
Supports: hdfs, http, https
Cache types: hardlink, symlink
Cache directory: nfs4 on [REDACTED]
Caches: local
Remotes: None
Workspace directory: nfs4 on [REDACTED]
Repo: dvc, git

Additional Information (if any):

Please note that at the end of this, it states the .git/object/pack directory is not empty, even though it can be deleted using rmdir without issue.

2021-03-17 12:02:49,159 DEBUG: state save (152491799, 1615977912613432576, 22244) 0b8958b3e558a41b046785ece234cbee
2021-03-17 12:02:49,164 DEBUG: state save (152491799, 1615977912613432576, 22244) 0b8958b3e558a41b046785ece234cbee
2021-03-17 12:02:49,179 DEBUG: state save (152487234, 1615977912709433600, 6498) 6f3ff4ce2d56a0a19e27ed2ebce075fa
2021-03-17 12:02:49,183 DEBUG: state save (152487234, 1615977912709433600, 6498) 6f3ff4ce2d56a0a19e27ed2ebce075fa
2021-03-17 12:02:49,226 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384
2021-03-17 12:02:49,231 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384
2021-03-17 12:02:49,236 DEBUG: {'MS_data/model_weights.h5': 'modified'}
2021-03-17 12:02:49,260 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c
2021-03-17 12:02:49,264 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c
2021-03-17 12:02:49,270 DEBUG: {'notebook_results/Train.ipynb': 'modified'}
2021-03-17 12:02:49,287 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954
2021-03-17 12:02:49,292 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954
2021-03-17 12:02:49,297 DEBUG: {'metrics.csv': 'modified'}
2021-03-17 12:02:49,315 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7
2021-03-17 12:02:49,319 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7
2021-03-17 12:02:49,325 DEBUG: {'train_metrics.json': 'modified'}
2021-03-17 12:02:49,344 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37
2021-03-17 12:02:49,348 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37
2021-03-17 12:02:49,353 DEBUG: {'val_metrics.json': 'modified'}
2021-03-17 12:02:49,361 DEBUG: Computed stage: 'train_model' md5: '5ca694409ddab75eb4b0ff0deddbef02'
2021-03-17 12:02:49,453 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384
2021-03-17 12:02:49,459 DEBUG: Checking out 'MS_data/model_weights.h5' with cache 'object md5: 343f60bca0df1ae495b598bbce29b384'.
2021-03-17 12:02:49,474 DEBUG: Created 'copy': ../../../cache/34/3f60bca0df1ae495b598bbce29b384 -> MS_data/model_weights.h5                    
2021-03-17 12:02:49,475 DEBUG: state save (152550395, 1615978969461410816, 293776) 343f60bca0df1ae495b598bbce29b384
2021-03-17 12:02:49,544 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c
2021-03-17 12:02:49,550 DEBUG: Checking out 'notebook_results/Train.ipynb' with cache 'object md5: 474d305f96b1fab25155390c8e7d2e2c'.
2021-03-17 12:02:49,563 DEBUG: Created 'copy': ../../../cache/47/4d305f96b1fab25155390c8e7d2e2c -> notebook_results/Train.ipynb                
2021-03-17 12:02:49,563 DEBUG: state save (152549836, 1615978969549413376, 240801) 474d305f96b1fab25155390c8e7d2e2c
2021-03-17 12:02:49,658 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954
2021-03-17 12:02:49,664 DEBUG: Checking out 'metrics.csv' with cache 'object md5: c0fda638e8a2cc6192b86dab818e8954'.
2021-03-17 12:02:49,670 DEBUG: Created 'copy': ../../../cache/c0/fda638e8a2cc6192b86dab818e8954 -> metrics.csv                                 
2021-03-17 12:02:49,671 DEBUG: state save (152550397, 1615978969657416704, 244) c0fda638e8a2cc6192b86dab818e8954
2021-03-17 12:02:49,739 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7
2021-03-17 12:02:49,745 DEBUG: Checking out 'train_metrics.json' with cache 'object md5: 66eb73c07b00069143c48a321af0abd7'.
2021-03-17 12:02:49,752 DEBUG: Created 'copy': ../../../cache/66/eb73c07b00069143c48a321af0abd7 -> train_metrics.json                          
2021-03-17 12:02:49,753 DEBUG: state save (152548593, 1615978969741419264, 126) 66eb73c07b00069143c48a321af0abd7
2021-03-17 12:02:49,799 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37
2021-03-17 12:02:49,805 DEBUG: Checking out 'val_metrics.json' with cache 'object md5: b8d9d9ec45c24e1f65474a88d22e1b37'.
2021-03-17 12:02:49,811 DEBUG: Created 'copy': ../../../cache/b8/d9d9ec45c24e1f65474a88d22e1b37 -> val_metrics.json                            
2021-03-17 12:02:49,812 DEBUG: state save (152549843, 1615978969801421056, 126) b8d9d9ec45c24e1f65474a88d22e1b37
2021-03-17 12:02:49,855 DEBUG: stage: 'train_model' was reproduced
Updating lock file 'dvc.lock'
2021-03-17 12:02:50,296 DEBUG: Commit to new experiment branch 'refs/exps/25/6b1cdc9d81d8f2feecd7b4538d526a8ddd28fc/exp-293c9'
2021-03-17 12:02:51,133 DEBUG: Collected experiment 'df26527'.                                                                                 
2021-03-17 12:02:51,135 DEBUG: Removing tmpdir '<ExpTemporaryDirectory '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk'>'
2021-03-17 12:02:51,135 DEBUG: Removing '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk'
2021-03-17 12:02:51,319 ERROR: unexpected error - [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk/.git/objects/pack'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 654, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'pack'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/main.py", line 50, in main
    ret = cmd.run()
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/command/experiments.py", line 525, in run
    results = self.repo.experiments.run(
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 815, in run
    return run(self.repo, *args, **kwargs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 28, in run
    return repo.experiments.reproduce_queued(jobs=jobs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 376, in reproduce_queued
    results = self._reproduce_revs(**kwargs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 46, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 522, in _reproduce_revs
    exec_results = self._executors_repro(executors, **kwargs)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 648, in _executors_repro
    executor.cleanup()
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 74, in cleanup
    self._tmp_dir.cleanup()
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 26, in cleanup
    self._rmtree(self.name)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 22, in _rmtree
    remove(name)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove
    shutil.rmtree(path, onerror=_chmod)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 715, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 656, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod
    func(p)
OSError: [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk/.git/objects/pack'
------------------------------------------------------------
2021-03-17 12:02:52,023 DEBUG: Version info for developers:
DVC version: 2.0.6 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-60-generic-x86_64-with-glibc2.10
Supports: hdfs, http, https
Cache types: hardlink, symlink
Cache directory: nfs4 on [REDACTED]
Caches: local
Remotes: None
Workspace directory: nfs4 on [REDACTED]
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-03-17 12:02:52,033 DEBUG: Removing '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik'
Exception ignored in: <finalize object at 0x7f8865619de0; dead>
Traceback (most recent call last):
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/weakref.py", line 566, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/tempfile.py", line 818, in _cleanup
    cls._rmtree(name)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 22, in _rmtree
    remove(name)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove
    shutil.rmtree(path, onerror=_chmod)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 715, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 656, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod
    func(p)
OSError: [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack'
2021-03-17 12:02:54,322 DEBUG: Analytics is enabled.
2021-03-17 12:02:54,501 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpyel92az0']'
2021-03-17 12:02:54,505 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpyel92az0']'
(tf2.4) lwerthen@sumo-ai:~/ms$ ls -al /home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack
total 33
drwxr-xr-x  2 lwerthen sumo  2 Mär 17 12:02 .
drwxr-xr-x 31 lwerthen sumo 31 Mär 17 12:02 ..
(tf2.4) lwerthen@sumo-ai:~/ms$ rmdir /home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack
(tf2.4) lwerthen@sumo-ai:~/ms$ echo $?
0

Metadata

Metadata

Assignees

Labels

A: experimentsRelated to dvc expbugDid we break something?p1-importantImportant, aka current backlog of things to doresearch

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions