-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
LockFile.dump() calls dvc.utils.yaml.dump_yaml(), which uses ruamel (ruamel.yaml.YAML.dump), whereas LockFile.load() calls dvc.utils.yaml.parse_yaml() which uses pyyaml (yaml.load with SafeLoader).
This results in a very subtle bug: ruamel uses the YAML 1.2 specification by default, whereas pyyaml uses the YAML 1.1 specification (https://pypi.org/project/PyYAML/).
This means that when dumping parameters to the lockfile, the 1.2 specification is used, which writes numbers in exponential notation like this: 1e-6. The YAML 1.1 specification, however, expects exponential notation numbers to include a dot: 1.0e-6, if it is not present, 1e-6 is read as a string instead of being read as a float.
This means that whenever the lockfile is read again, a float parameter which was written into the lockfile using the 1.2 specification, is read as a string instead of being read as a float, this results in dvc status always marking the parameters file as modified. When launching dvc params diff however, both params.yaml and dvc.lock are read using yaml.safe_load which uses the 1.1 specification thus resulting in an empty diff, which is kinda confusing: dvc is in a dirty status but dvc params diff shows nothing.
A (partial?) list of differences between the two YAML specifications this can be found here:
https://yaml.readthedocs.io/en/latest/pyyaml.html?highlight=specification
A discussion about pyaml's parsing of floats can be found here:
yaml/pyyaml#173
See this comment about pyaml being focused on YAML spec 1.1. yaml/pyyaml#174 (comment)
Dirty solution
Thanks to @ariciputi. This hack forces the ruamel yaml parser to use the 1.1 specification, thus solving the issue.
Out[65]: {'hello': [1e-06, 1e-05, 0.0001, 0.001, 0.01]}
In [66]: cc = StringIO()
In [67]: yaml_12 = ruamel.yaml.YAML()
In [68]: yaml_12.dump(adict, cc)
In [69]: cc.seek(0)
Out[69]: 0
In [70]: cc.getvalue()
Out[70]: 'hello:\n- 1e-06\n- 1e-05\n- 0.0001\n- 0.001\n- 0.01\n'
In [71]: yaml_11 = ruamel.yaml.YAML()
In [72]: yaml_11.version = (1,1)
In [73]: dd = StringIO()
In [74]: yaml_11.dump(adict, dd)
In [75]: dd.seek(0)
Out[75]: 0
In [76]: dd.getvalue()
Out[76]: '%YAML 1.1\n---\nhello:\n- 1.0e-06\n- 1.0e-05\n- 0.0001\n- 0.001\n- 0.01\n'
In [77]: import yaml
In [78]: yaml.safe_load(_76)
Out[78]: {'hello': [1e-06, 1e-05, 0.0001, 0.001, 0.01]}
In [79]: yaml.safe_load(_70)
Out[79]: {'hello': ['1e-06', '1e-05', 0.0001, 0.001, 0.01]}
In [80]:
Our suggestion (mine and @ariciputi 's) is to choose only one yaml library and stick with it.
Please provide information about your setup
Python version: 3.8.3
Platform: macOS-10.14.6-x86_64-i386-64bit
Binary: False
Package: None
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Repo: dvc, git
Filesystem type (workspace): ('apfs', '/dev/disk1s1')