Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'track_times' argument for 'dump' has no use #130

Closed
1313e opened this issue Jun 12, 2020 · 4 comments · Fixed by #134
Closed

'track_times' argument for 'dump' has no use #130

1313e opened this issue Jun 12, 2020 · 4 comments · Fixed by #134
Assignees
Labels

Comments

@1313e
Copy link
Collaborator

1313e commented Jun 12, 2020

So, while I was going through the tests of the package a bit for the v4 release, I noticed that the tests for the track_times optional argument do not actually do anything, which you can see here (if caching_dump is never called before the test_track_times function, which is the case, then that test does nothing as DUMP_CACHE is empty):

def md5sum(filename, blocksize=65536):
""" Compute MD5 sum for a given file """
hash = hashlib.md5()
with open(filename, "r+b") as f:
for block in iter(lambda: f.read(blocksize), ""):
hash.update(block)
return hash.hexdigest()
def caching_dump(obj, filename, *args, **kwargs):
""" Save arguments of all dump calls """
DUMP_CACHE.append((obj, filename, args, kwargs))
return hickle_dump(obj, filename, *args, **kwargs)
def test_track_times():
""" Verify that track_times = False produces identical files """
hashes = []
for obj, filename, mode, kwargs in DUMP_CACHE:
if isinstance(filename, hickle.H5FileWrapper):
filename = str(filename.file_name)
kwargs['track_times'] = False
caching_dump(obj, filename, mode, **kwargs)
hashes.append(md5sum(filename))
time.sleep(1)
for hash1, (obj, filename, mode, kwargs) in zip(hashes, DUMP_CACHE):
if isinstance(filename, hickle.H5FileWrapper):
filename = str(filename.file_name)
caching_dump(obj, filename, mode, **kwargs)
hash2 = md5sum(filename)
print(hash1, hash2)
assert hash1 == hash2

That is actually a good thing, as it would infinitely loop in Python 3 due to string representation differences.

However, I decided to check what exactly track_times does.
According to the h5py documentation, this argument is solely used for creating datasets; is set to True by default; and enables the saving of the creation timestamps.
It is rather unclear actually what that means or what impact it has, and it is not mentioned anywhere else in the documentation.

But, seeing as it is set to True by default, which is the same default as dump has in hickle, there is no point in specifically asking for its value.
Furthermore, dump already allows for additional kwargs to be provided that are passed to the create_dataset method, so if anyone ever wants to set track_times to a different value, they can.

Also, in hickle, the argument is described as making sure that repeated hickling results in different files.
This description itself is very vague as what that exactly means, as repeated hickling using the exact same script either always results in a different file (if the file was opened for writing and truncated, thus becoming a new file) or impossible (if it was not truncated, causing the data to already exist with the same name).

Given the above, I feel like that the argument has no use besides confusing the user, and should therefore be removed in both v3 and v4.
@telegraphic What is your opinion?

@telegraphic
Copy link
Owner

This was a specific feature request from a user -- the user wanted md5sums for identical data to be identical. Agreed that the documentation is confusing, as the use case was the opposite: allowing md5sum tests on files with identical contents.

@1313e
Copy link
Collaborator Author

1313e commented Jun 15, 2020

This was a specific feature request from a user -- the user wanted md5sums for identical data to be identical. Agreed that the documentation is confusing, as the use case was the opposite: allowing md5sum tests on files with identical contents.

But it won't.
I tried it out myself: If I save the same data to the same file twice, the md5sum will be different, as it is impossible to save that data to the same location as well.
But, even then, we can simply remove the track_times argument, as one can simply provide track_times=False to dump.

@1313e
Copy link
Collaborator Author

1313e commented Jun 15, 2020

Alright, I finally found where that feature was requested (#13).
I think we can still remove the code for it, as the kwargs from dump and load are passed to basically all functions, and are always used when creating groups or datasets.
This would also make the code a bit cleaner.

@telegraphic
Copy link
Owner

Agreed that **kwargs will catch this and pass it on, so OK to remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants