Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate Hashes for Install Directories vs. Modules #3513

Closed
citibeth opened this issue Mar 22, 2017 · 5 comments
Closed

Separate Hashes for Install Directories vs. Modules #3513

citibeth opened this issue Mar 22, 2017 · 5 comments
Assignees
Labels

Comments

@citibeth
Copy link
Member

@tgamblin @healther @adamjstewart

Spack currently creates install directories and modules, both based on a fully concretized spec (FCS). Install directories and modules are identified by hash; and the same hash algorithm is currently used to generate the hash used to identify both types of objects. Using the same hash for both seems intuitively correct. However, recent discussion on #3501 suggests that it might actually be incorrect; and that Spack might benefit from decoupling of the hash used to identify install directories vs. modules.

Hashes are useful for efficiently labelling two things as "same" or "possibly different." If two install directories have the same hash, then we can surmise that their contents are the same. Similarly, if a FCS-A hashes to hash-A, and Spack finds an install directory labelled with hash-A, then Spack can (and does) surmise that building FCS-A would result in a directory that is "the same" as the install directory it just found; and so it does not need to re-build. Note that our notion of "the same" has been left fuzzy; we care about functional equivalence, not byte-for-byte equivalence of every file in the install directory.

Hashes are not always perfect. It is OK if FCS-A and FCS-B hash to something different, even if their install directories are the same. That will result in unnecessary builds and annoyed users, but won't break anything. However, the converse is not OK. If FCS-A and FCS-B hash to the same thing, then their install directories must be the same. (This is why packaging systems like pip cause problems for Spack; they modify their install directory after Spack is done installing them).

Now suppose fully concretized spec X involves a run dependency Y. Should Y be included in the hash for X? Looking at the install directory... If we accept for now that run dependencies do not affect the contents of the install directory, then clearly Y should not be included in the hash. BUT looking at modules... run dependencies do materially affect the contents of the generated module. Therefore, run dependencies do need to be included in the hash used to label the module.

Conclusion: To be fully correct, the hash algorithm used for modules and install directories need to be different. This of course might be user-unfriendly: it is convenient for them to be the same. But it seems that the simplest, "purest" system would have the two use different hashes. Maybe there's some clever way to hide this from the user. Or maybe the algorithm laid out in #3501 can be seen more simply with this understanding.

Unfortunately, it doesn't stop there. Once we start removing some dependencies from the install-dir or module hash, we will want to keep others. For example... build dependencies should be hashed (eg if they're a compiler), except when they shouldn't be (for example, if Bison is used, and the parser generated by any Bison version is functionally equivalent; or if the dependency is doxygen and we just don't want docs to affect the hash). Similarly, run dependencies shouldn't be hashed for install directories... except when they should be; maybe a full path to a run dependency snuck in there somehow.

To really get it right, it seems we will need to give users control over whether individual dependencies are/are not hashed, for each hash algorithm. Sure, we can have defaults based on the deptype. But those defaults will need to be overridden 5% of the time.

In the meantime.... without distinct hashes for modules vs. install directories, and without fine-grained control over what goes in the hash... Spack errs on the side of caution. It puts everything in the hash, and now and then annoys users with unnecessary rebuilds.

@healther
Copy link
Contributor

I'm not sure if I'm correct in this, but I thought of the hashes as a unique identifier of the contents of an installation directory. In that case it would only need to hash all build-dependencies.

The generated modules then would have to incorporate the link- and run-dependencies as well as the hashes of those dependencies. The output of spack find should then only contain the latter sort of hashes (effectively moving the install-hashes to be "internal use only").

Obviously I'm missing something important here...

To really get it right, it seems we will need to give users control over whether individual dependencies are/are not hashed, for each hash algorithm. Sure, we can have defaults based on the deptype. But those defaults will need to be overridden 5% of the time.

I'm not sure if you really want to do this, spack was intended to provide reproducible results -> this would make reproducibility largely impossible, wouldn't it?
Multiple builds aren't really the problem, usability is. My problem is that I cannot (easily) load my installed module in a working fashion, once I have multiple versions (including broken ones). If one would want to reduce the number of builds I guess this would be a way.

@citibeth
Copy link
Member Author

citibeth commented Mar 22, 2017 via email

@citibeth
Copy link
Member Author

@scheibelp This relates to your work on deptypes.

@scheibelp
Copy link
Member

@scheibelp This relates to your work on deptypes.

Specifically that is referring to #2548. To update this thread with discussion from the telecon: #2548 was more focused on how deptypes affect environment setup for a build; IMO this is a different issue. My initial read-through of #3501 is that it is also a different issue: it started out potentially related to #2548 in that deptypes determine which modules are loaded; it then turned into a discussion of whether there should be support for changing deptypes in package.py without a reinstall.

Regarding the concept of removing run dependencies (or rather run-only dependencies) from the hash, I think there may be special cases which make this difficult: what happens if top-level dependency requires an output format from a run dependency that changes for some new version? Or perhaps if newer versions of a run dependency add support for new commands? There does seem to be a certain constraints on run dependencies, although they are likely not as strict as for link dependencies (for example where variants are much more likely to alter compatibility).

@alalazo alalazo self-assigned this Nov 23, 2017
@alalazo
Copy link
Member

alalazo commented Dec 9, 2019

Closing the issue as stale. Feel free to reopen if you think something still needs to be discussed.

@alalazo alalazo closed this as completed Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants