Separate Hashes for Install Directories vs. Modules #3513

citibeth · 2017-03-22T03:55:32Z

Spack currently creates install directories and modules, both based on a fully concretized spec (FCS). Install directories and modules are identified by hash; and the same hash algorithm is currently used to generate the hash used to identify both types of objects. Using the same hash for both seems intuitively correct. However, recent discussion on #3501 suggests that it might actually be incorrect; and that Spack might benefit from decoupling of the hash used to identify install directories vs. modules.

Hashes are useful for efficiently labelling two things as "same" or "possibly different." If two install directories have the same hash, then we can surmise that their contents are the same. Similarly, if a FCS-A hashes to hash-A, and Spack finds an install directory labelled with hash-A, then Spack can (and does) surmise that building FCS-A would result in a directory that is "the same" as the install directory it just found; and so it does not need to re-build. Note that our notion of "the same" has been left fuzzy; we care about functional equivalence, not byte-for-byte equivalence of every file in the install directory.

Hashes are not always perfect. It is OK if FCS-A and FCS-B hash to something different, even if their install directories are the same. That will result in unnecessary builds and annoyed users, but won't break anything. However, the converse is not OK. If FCS-A and FCS-B hash to the same thing, then their install directories must be the same. (This is why packaging systems like pip cause problems for Spack; they modify their install directory after Spack is done installing them).

Now suppose fully concretized spec X involves a run dependency Y. Should Y be included in the hash for X? Looking at the install directory... If we accept for now that run dependencies do not affect the contents of the install directory, then clearly Y should not be included in the hash. BUT looking at modules... run dependencies do materially affect the contents of the generated module. Therefore, run dependencies do need to be included in the hash used to label the module.

Conclusion: To be fully correct, the hash algorithm used for modules and install directories need to be different. This of course might be user-unfriendly: it is convenient for them to be the same. But it seems that the simplest, "purest" system would have the two use different hashes. Maybe there's some clever way to hide this from the user. Or maybe the algorithm laid out in #3501 can be seen more simply with this understanding.

Unfortunately, it doesn't stop there. Once we start removing some dependencies from the install-dir or module hash, we will want to keep others. For example... build dependencies should be hashed (eg if they're a compiler), except when they shouldn't be (for example, if Bison is used, and the parser generated by any Bison version is functionally equivalent; or if the dependency is doxygen and we just don't want docs to affect the hash). Similarly, run dependencies shouldn't be hashed for install directories... except when they should be; maybe a full path to a run dependency snuck in there somehow.

To really get it right, it seems we will need to give users control over whether individual dependencies are/are not hashed, for each hash algorithm. Sure, we can have defaults based on the deptype. But those defaults will need to be overridden 5% of the time.

In the meantime.... without distinct hashes for modules vs. install directories, and without fine-grained control over what goes in the hash... Spack errs on the side of caution. It puts everything in the hash, and now and then annoys users with unnecessary rebuilds.

The text was updated successfully, but these errors were encountered:

healther · 2017-03-22T08:31:19Z

I'm not sure if I'm correct in this, but I thought of the hashes as a unique identifier of the contents of an installation directory. In that case it would only need to hash all build-dependencies.

The generated modules then would have to incorporate the link- and run-dependencies as well as the hashes of those dependencies. The output of spack find should then only contain the latter sort of hashes (effectively moving the install-hashes to be "internal use only").

Obviously I'm missing something important here...

To really get it right, it seems we will need to give users control over whether individual dependencies are/are not hashed, for each hash algorithm. Sure, we can have defaults based on the deptype. But those defaults will need to be overridden 5% of the time.

I'm not sure if you really want to do this, spack was intended to provide reproducible results -> this would make reproducibility largely impossible, wouldn't it?
Multiple builds aren't really the problem, usability is. My problem is that I cannot (easily) load my installed module in a working fashion, once I have multiple versions (including broken ones). If one would want to reduce the number of builds I guess this would be a way.

citibeth · 2017-03-22T12:27:29Z

I'm not sure if I'm correct in this, but I thought of the hashes as a unique identifier of the contents of an installation directory. In that case it would only need to hash all build-dependencies.

yes

The generated modules then would have to incorporate the link- and run-dependencies as well as the hashes of those dependencies.

yes

The output of spack find should then only contain the latter sort of hashes (effectively moving the install-hashes to be "internal use only").

yes. Although `spack find`, `spack load`, etc. and weak and should be replaced with more useful ways of finding stuff that's been built (see Spack Environments).

Obviously I'm missing something important here... To really get it right, it seems we will need to give users control over whether individual dependencies are/are not hashed, for each hash algorithm. Sure, we can have defaults based on the deptype. But those defaults will need to be overridden 5% of the time. I think this is orthogonal to the issues you're dealing with, and it's a

somewhat minor detail. The idea is that in general, Spack would now need to know whether each declared dependency (`depends_on()`) does or does not add to (a) the install dir hash, (b) the module hash. I'm concluding that users would probably have to have the option to declare this information directly in `depends_on()`, rather than relying on the deptype to divine it (although relying on deptype would probably work in most cases).

My problem is that I cannot (easily) load my installed module in a working fashion, once I have multiple versions (including broken ones). If one would want to reduce the number of builds I guess this would be a way.

Yes, that is exactly why `spack load` is irretrievably broken; and it is orthogonal to these other issues. Spack Environments provide a way out of that mess. Basically... you tell Spack what you're interested in as part of a complete environment you want to use. Spack will then build it and assemble the resulting packages either into a Spack View, or a bunch of `module load` commands. Then, you never have to use `spack load` or go through the ambiguity of more than one package installed. https://github.com/LLNL/spack/wiki/Elizabeth%27s-Conceptual-Framework-for-Environments For the time being, you might want to check out this #2698. It is a "poor man's" Spack Environment, and is a bit arcane to use. Therefore, it will change when we do the "real" Spack Environments. But it does get the job done, and I rely on it.

citibeth · 2017-03-23T20:37:03Z

@scheibelp This relates to your work on deptypes.

scheibelp · 2017-03-24T18:37:30Z

@scheibelp This relates to your work on deptypes.

Specifically that is referring to #2548. To update this thread with discussion from the telecon: #2548 was more focused on how deptypes affect environment setup for a build; IMO this is a different issue. My initial read-through of #3501 is that it is also a different issue: it started out potentially related to #2548 in that deptypes determine which modules are loaded; it then turned into a discussion of whether there should be support for changing deptypes in package.py without a reinstall.

Regarding the concept of removing run dependencies (or rather run-only dependencies) from the hash, I think there may be special cases which make this difficult: what happens if top-level dependency requires an output format from a run dependency that changes for some new version? Or perhaps if newer versions of a run dependency add support for new commands? There does seem to be a certain constraints on run dependencies, although they are likely not as strict as for link dependencies (for example where variants are much more likely to alter compatibility).

alalazo · 2019-12-09T13:36:36Z

Closing the issue as stale. Feel free to reopen if you think something still needs to be discussed.

citibeth mentioned this issue Mar 22, 2017

Incorrect dependency handling on spack load --dependencies #3501

Closed

citibeth added hashes labels Mar 22, 2017

alalazo self-assigned this Nov 23, 2017

alalazo added the modules label Nov 23, 2017

alalazo closed this as completed Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate Hashes for Install Directories vs. Modules #3513

Separate Hashes for Install Directories vs. Modules #3513

citibeth commented Mar 22, 2017

healther commented Mar 22, 2017

citibeth commented Mar 22, 2017 via email

citibeth commented Mar 23, 2017

scheibelp commented Mar 24, 2017

alalazo commented Dec 9, 2019

Separate Hashes for Install Directories vs. Modules #3513

Separate Hashes for Install Directories vs. Modules #3513

Comments

citibeth commented Mar 22, 2017

healther commented Mar 22, 2017

citibeth commented Mar 22, 2017 via email

citibeth commented Mar 23, 2017

scheibelp commented Mar 24, 2017

alalazo commented Dec 9, 2019