-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Original discussion with user: https://discord.com/channels/485586884165107732/485596304961962003/918535356377346089
Problem:
When adding dataset of ~200k files, the user experiences timeout error coming from our state:
2021-12-09 15:35:26,127 DEBUG: state save (42821908, 1619677552000000000, 509) cf5a27606dc100584853c7ed34d1e8a0
2021-12-09 15:35:26,306 DEBUG: state save (42821294, 1619676946000000000, 526) bc2f6a5e38a7afc59ad75b043ff24c8f
2021-12-09 15:35:26,342 DEBUG: state save (42814780, 1619677510000000000, 705) 218705ae16f023e010f3b1d7358f75d2
Adding...
2021-12-09 15:35:29,315 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
File "dvc/main.py", line 55, in main
File "dvc/command/base.py", line 45, in do_run
File "dvc/command/add.py", line 21, in run
File "dvc/utils/collections.py", line 163, in inner
File "dvc/repo/__init__.py", line 50, in wrapper
File "dvc/repo/scm_context.py", line 14, in run
File "dvc/repo/add.py", line 190, in add
File "dvc/stage/__init__.py", line 457, in save
File "dvc/stage/__init__.py", line 477, in save_outs
File "dvc/output.py", line 558, in save
File "dvc/objects/stage.py", line 296, in stage
File "dvc/objects/stage.py", line 170, in _stage_tree
File "dvc/objects/stage.py", line 138, in _build_tree
File "dvc/objects/stage.py", line 130, in _iter_objects
File "dvc/objects/stage.py", line 126, in _build_objects
File "concurrent/futures/_base.py", line 611, in result_iterator
File "concurrent/futures/_base.py", line 439, in result
File "concurrent/futures/_base.py", line 388, in __get_result
File "concurrent/futures/thread.py", line 57, in run
File "dvc/progress.py", line 133, in wrapped
File "dvc/objects/stage.py", line 83, in _stage_file
File "dvc/objects/stage.py", line 66, in get_file_hash
File "dvc/state.py", line 118, in get
File "diskcache/core.py", line 1189, in get
File "contextlib.py", line 113, in __enter__
File "diskcache/core.py", line 733, in _transact
diskcache.core.Timeout
------------------------------------------------------------
2021-12-09 15:35:30,115 DEBUG: Version info for developers:
DVC version: 2.8.3 (deb)
---------------------------------
Platform: Python 3.8.3 on Linux-4.15.0-163-generic-x86_64-with-glibc2.14
Supports:
azure (adlfs = 2021.9.1, knack = 0.9.0, azure-identity = 1.7.0),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2021.11.0),
hdfs (fsspec = 2021.11.0, pyarrow = 6.0.0),
webhdfs (fsspec = 2021.11.0),
http (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),
s3 (s3fs = 2021.11.0, boto3 = 1.17.106),
ssh (sshfs = 2021.11.0),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.3),
webdavs (webdav4 = 0.9.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
It seems that the reason for that is that one of the threads could not access the state cache in time. Reducing the number of checksum_jobs to 1 helped, though the execution took a long time.
My first idea for a fix is to increase the timeout because in our use case we cannot easily define what is a reasonable amount of time after which we should get the result - in the case of big datasets it can be hours.
Another thing is that diskcache.Timeout tells quite a lot in this particular use case, while the user gets slapped with unexpected error. We could make this message more informative.