Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path sis_hash is now cached and used for __lt__, and __eq__. #95

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

critias
Copy link
Contributor

@critias critias commented Jul 12, 2022

PR to address __lt__ and __eq__ issue for path mentioned here: #94
__hash__ is a bigger problem since it's used by pickle even before all values are set.

@JackTemaki
Copy link
Contributor

As a note, in the current form this PR breaks my setups.

@critias
Copy link
Contributor Author

critias commented Jul 13, 2022

I assume by breaking you setup you mean some job hashes changed?
By changing the way __lt__ the order of some sorting operations can change. It's the main reason I didn't merge it. It didn't happen in the setups I tested, but still wanted to see if I can create a case were this happens.

I'm not sure right now how to deal with it. I don't the a way how to integrate hash_overwrite into the computation without breaking backwards compatibility in some cases. The current solution should be faster and uses less code, but breaking old setups isn't an option. The only thing I can come up with is to make this new behavior optional, which wouldn't be very nice.

We could also keep the current behavior and just add a comment that this is the reason for it. @albertz did this behavior cause any problem for you or did you just stumble upon it and wanted to know why it's implemented like this?

@albertz
Copy link
Member

albertz commented Jul 13, 2022

You change two things here:

  • introduce caching logic
  • change __lt__, __eq__

I assume the caching logic could break some code. Maybe some code sets the creator or path or hash_overwrite attribute directly and then the hash is wrong if it was cached before.

I think that you can also directly fix __lt__ and __eq__ and __hash__ to address #94, i.e. making it consistent with _sis_hash, i.e. considering hash_overwrite. I just stumbled upon this code and this looked wrong to me, that _sis_hash is inconsistent to __hash__ or __eq__ but I'm not sure how much this is a problem in practice. In case you do not use hash_overwrite, it looks mostly correct.

@critias
Copy link
Contributor Author

critias commented Jul 14, 2022

You are right, the cacheing should be it's own PR and would need to be invalidated if creator, path, or hash_overwrite changes. I'll remove it from this PR.

Anyway, the sorting might change in some cases if hash_overwrite is considered which breaks backwards compatibility. Leaving it like it is could lead to unexpected behavior, I would expect if to Path return the same sis_id to also return the same hash.

I think I make it an option for now, disabled it by default and test it in practices first.

@albertz
Copy link
Member

albertz commented Jul 14, 2022

You are right, the cacheing should be it's own PR and would need to be invalidated if creator, path, or hash_overwrite changes. I'll remove it from this PR.

It's even more complicated. If creator stays the same, but its internal state changes, so the hash of creator changes, the cache also needs to be invalidated. But you can never detect that automatically. So basically doing such cache in a correct way is not possible. We might hope that the hash of creator maybe never changes but this is maybe not always the case.

@albertz
Copy link
Member

albertz commented Jul 14, 2022

Anyway, the sorting might change in some cases if hash_overwrite is considered which breaks backwards compatibility.

Yes it changes the behavior, but I'm not sure if this really would break anything. Also, it is mostly (only?) really for the case only with hash_overwrite, which is maybe not used so often.

@albertz
Copy link
Member

albertz commented Jul 14, 2022

My suggestion in #94 was actually to keep the behavior of __lt__ and __eq__ and __hash__ just as before for the case without hash_overwrite. I just meant that with a hash_overwrite, it should follow similar logic as in _sis_hash.

@critias
Copy link
Contributor Author

critias commented Jul 14, 2022

It's even more complicated. If creator stays the same, but its internal state changes, so the hash of creator changes, the cache also needs to be invalidated. But you can never detect that automatically. So basically doing such cache in a correct way is not possible. We might hope that the hash of creator maybe never changes but this is maybe not always the case.

The creator is always assumed to be a Job and sis_id of a Job is set and computed internally by Sisyphus at construction time and assumed to be fixed. If you overwrite these internal variables you are running into undefined behavior anyway. So caching would be doable. On the other hand it probably doesn't saves enough time to be worth the risk of breaking something.

Yes it changes the behavior, but I'm not sure if this really would break anything. Also, it is mostly (only?) really for the case only with hash_overwrite, which is maybe not used so often.

I think the ordering stays the same as long as now hash_overwrite is used. Before it compares something like:
'Job<sis_id_of_job_1> outfile_name_1' vs 'Job<sis_id_of_job_2> outfile_name_2'
after the change it's
b"(Path, (tuple, (str, 'sis_id_of_job_1/output'), (str, 'outfile_name_1')))" vs b"(Path, (tuple, (str, 'sis_id_of_job_2/output'), (str, 'outfile_name_2')))"

So the only thing that changes are fixed items added on both sides which shouldn't change the ordering. Well, at least as long as the total string isn't longer than 4096 (at that point the intermediate byte sting will be hash to avoid passing around arbitrary long strings). I think this can be negated since the maximum filename is for most linux filesystems 255 characters. The maximum path length in Linux is 4096, so it could be reached with many subfolders, but I don't think this is a realistic case here.

@JackTemaki Does your setup still break with the current version if you set USE_SIS_HASH_FOR_PATH_COMPARISON = True?

@albertz
Copy link
Member

albertz commented Jul 14, 2022

The creator is always assumed to be a Job and sis_id of a Job is set and computed internally by Sisyphus at construction time and assumed to be fixed.

Couldn't you update inputs via Job.update, and then the job hash would change?

@critias
Copy link
Contributor Author

critias commented Jul 14, 2022

Couldn't you update inputs via Job.update, and then the job hash would change?

No, update add's inputs, but the Job id stays the same. These additional inputs depend only on the original Job inputs and parameters, the Job outputs are therefore also only depending on the original given inputs and parameters.

@JackTemaki
Copy link
Contributor

@JackTemaki Does your setup still break with the current version if you set USE_SIS_HASH_FOR_PATH_COMPARISON = True?

Does not seem to be the case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants