Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where do containers fit in? #1

Open
vsoch opened this issue Dec 31, 2020 · 14 comments
Open

Where do containers fit in? #1

vsoch opened this issue Dec 31, 2020 · 14 comments

Comments

@vsoch
Copy link
Member

vsoch commented Dec 31, 2020

I'm reading over the overview draft (really exciting!) and I'd like to brainstorm how containers fit into this model. Right now we have them as a part of composition:

Bindmount libraries into containers. Verify ABI compatibility of sub-DAG from host with binaries in container ecosystem

Which I think means that we would be able to build packages into containers with spack containerize and then check compatability with libraries on the host (MPI for Singularity comes to mind as a good example).

So to step back - there are two scenarios I can see containers:

  1. Build an entire set of packages into a container (e.g., spack containerize). We are following the same build routines but inside of a container with spack. For running this container, we'd need to be checking against the host for ABI compatability, per what is written into the current spec.
  2. Obtain a package as a container (not sure this exists).
  3. Build a package as a container (a derivative of 1, but with just one package, and hopefully with multi staged builds to minimize the redudancy) and then add the executable container to the path akin to having pulled it (2.)

For the second point, I'm wondering if this could be a use case for spack (or this general package manager model), period. If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do? A "build" really comes down to ensuring the container technology is installed, and there is a means to pull based on a specific hash or tag, and then have containers as executables on the path (Singularity) or run them (Docker, less likely for HPC, but podman and friends are just around the corner). We can focus on the Singularity use case to start, since the container is akin to an executable binary. This would mean that the user is allowed to install any container URI available, and the containers in storage would need to be namespaced depending on their URI. Reproducibility then would not depend on what spack does, but on if the version of the container changes. We would then need some way to still check the container for ABI compatability with the host (again focusing on Singularity). In the same way we could export a tree of packages and dependencies, we could also export a list of containers.

For point 3, this is similar to the idea of having isolated environments for other needs too (I remember the discussion about pex, for example). It would allow the user to have a combination of natively built packages and containers "for all those other use cases where I want to keep things separate."

And another idea, maybe this is a point 4. If we are bind mounting containers to potentially eventually link to a library inside, you could imagine having containers that exist only to serve as bind resources for some set of libraries that might require very hard to satisfy host dependencies.

@cosmicexplorer
Copy link

cosmicexplorer commented Jan 11, 2021

Response 1: @cosmicexplorer

Thank you so much for this enduring and thoughtful brainstorm!! First thoughts:

Clarifications

check compatability with libraries on the host (MPI for Singularity comes to mind as a good example).

To be perfectly clear, is this referring to compat checking by running spack external find and then spack concretize? Or referring to the output of something like libabigail?

  1. Build an entire set of packages into a container (e.g., spack containerize). We are following the same build routines...

Thank you for clearly separating this!

Package <=> Container

  1. Obtain a package as a container (not sure this exists).

It does not exist but I REALLY hope we can establish a fully bijective relationship between spack environments and spack-built containers!!! Why?

  • Because I really want to be able to do the analogous "expose a concretized (?) spack environment, fully built (?), as a dependency"
    • Which other spack environments can import during a build stage without having to concretize and download all of its recursive dependencies every time!!
      • This is clearly useful already for something like gcc or rust, which just installs itself from scratch each time.
      • This technique may be necessary or assumed to be needed to implement binary packages, which you go into depth on below.
        • @becker33 could you speak on whether "not downloading recursive dependencies of a binary package" is or could be a goal of the binary packages development?

Hermetic Process Executions via Containers

For the second point, I'm wondering if this could be a use case for spack (or this general package manager model), period. If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do? A "build" really comes down to ensuring the container technology is installed, and there is a means to pull based on a specific hash or tag, and then have containers as executables on the path (Singularity) or run them (Docker, less likely for HPC, but podman and friends are just around the corner). We can focus on the Singularity use case to start, since the container is akin to an executable binary. This would mean that the user is allowed to install any container URI available, and the containers in storage would need to be namespaced depending on their URI. Reproducibility then would not depend on what spack does, but on if the version of the container changes. We would then need some way to still check the container for ABI compatability with the host (again focusing on Singularity). In the same way we could export a tree of packages and dependencies, we could also export a list of containers.

Wow, I completely missed this paragraph at first!!! This is EXACTLY what I want to investigate! This comes from a few places:

A "build" really comes down to ensuring the container technology is installed, and there is a means to pull based on a specific hash or tag, and then have containers as executables on the path.

I think it sounds like we will be able to express the container environment in terms of concepts that directly map back to trees of operations over specific packages, which leads me to propose:

Express Specs in Terms of Environments Composed of Merging Package Operations

  • I propose we should specifically investigate a process of merging environments with a few goals:
    • Hammer out a concept of "possibly-abstract spec" composed as a tree (from a parsed AST which may include e.g. edge attributes as per break build systems and dependency `type`s down into more granular resource dependencies spack/spack#20523) which would be able to represent :
      • the conditional dependency of a depends_on('a@2', when='@3') directive in a package.py.
      • the totally ordered dependencies in a spack.yaml environment, which may have been added at different times.
        • Define statefulness in terms of tree operations.
    • Develop a sub-theory for pip, cargo, coursier resolves.
    • And in particular ipex (introduce --generate-ipex to (v1) python binary creation to lazy-load requirements pantsbuild/pants#8793) for lazy-downloading wheels only when the application is first initialized.
      • This concept can be extended to spack to make lazy-loading python applications nonetheless remain hermetic after initialization!
        • I believe this is where you saw the container <=> pex analogy in the first place?
    • Define a set of operations which cover every action we perform at the package level of abstraction, in order, when running spack install.
      • Define which operations are possible from each state, and what a "state" is.
      • All operations are stateless, with actors representing state according to their message passing.
        • That could let us use the concretizer to schedule spack caching, subprocess scheduling, as well!
          • May not necessarily be the same type of ASP or SMT solver as well, possibly a Reinforcement Learning approach (broad, I know) has been used for this in the literature.
      • In particular, describe concretization as an operation which can be statefully performed asynchonrously over a subgraph of all packages.
        • This isolates "the part of spack that requires clingo" (aka clean up how spack builds clingo spack/spack#20159) to a single operation now.
        • Being able to specify subgraphs to perform full concretization on (or error) opens up the possibility of swapping out the concretizer strategy for a particular phase of a build e.g. with z3 (and to represent that unambiguously).
        • In particular, we might like to express every operation as a (cacheable, remote-enabled: see hermetic/remoteable/cacheable process executions spack/spack#20407) process execution (not necessarily even with the same protobuf).
          • Why? Because that makes it into an easily machine-parseable file format which can unfold itself in a recursive, mergeable, asynchronous, parallelizable bootstrapping process, which all of spack isn't whatsoever needed to execute!

In the same way we could export a tree of packages and dependencies, we could also export a list of containers.

  • I think that my "possibly-abstract spec" corresponds precisely to your "tree of packages and dependencies". I think a container contains resources (from minimal zsh completion spack/spack#20253) that we can spack external find for on their fs root.
  • I think that a container with a command line attached is the exact same thing as a cacheable process execution as per hermetic/remoteable/cacheable process executions spack/spack#20407:
    • There's a start filesystem state, and process command line args and env, and an end filesystem state.

If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do?

AFAICT: no!

  1. Implementors of the bazel remote execution API as per hermetic/remoteable/cacheable process executions spack/spack#20407 actually all have much much less understanding of native ABIs than spack, and there are many companies running their entire CI pipeline on calls through this API without any concept of platform knowledge.
    • This is a known problem but google doesn't have this problem so they're not likely to fix it on their own.
    • Lots of other bazel users may become very interested in contributing to spack.
  2. It sounds like you have a pretty good context of the "container registry" interface @vsoch:
    • Could I ask you to add that to this outline so we can start arguing over equivalences (below)?
    • Is the user still going to be executing spack here? Not sure is ok!

Equivalence of Containers to Other Virtualization Methods (?)

This would mean that the user is allowed to install any container URI available, and the containers in storage would need to be namespaced depending on their URI.
Reproducibility then would not depend on what spack does, but on if the version of the container changes. We would then need some way to still check the container for ABI compatability with the host (again focusing on Singularity).

I think that this is very likely to be fungible with compatibility checking we would do elsewhere.

spack/spack#20359 and spack/spack#20407 have some generic methods described for storing filesystem trees with deduplication. One interesting thought I just had was whether we could ever (or always???) offer a completely fungible "on my own filesystem please" vs an "install into/from a container's filesystem" option. I would love to be able to dump the multiple use cases and failure modes of containers you've described above into this document and explicitly aim to meet 100% of container use cases you've mentioned with some virtualization mechanism.

Compilers as Deps and Edge Atrributes

I also got excited just now because I believe that your (2) would be one of the most natural ways to implement compilers that require themselves to compile.

Essentially I was seeing a dependency expression started by the ^ sigil as specifying the existence of an edge in the DAG connected to a specific node but with the originating node still to be determined by the concretization process. The square brackets delimit the context in which we can specify edge attributes.

If we can finally take the time to break out all of the implicit assumptions for such edge connections that are currently flattened by dependency type in impl code first (see spack/spack#20523), we may subsequently be able to take the leap of extending the spec syntax with less fear.

SDK Concept

  • @alalazo also describes where it breaks down:

Concerning the separation in the syntax of virtual dependencies from normal dependencies: we need to take into account packages that are both a provider of something and a normal dependency at the same time. This might happen with packages that are bundle of many different things, like intel-parallel-studio. The use case is probably related to the SDK concept @scheibelp is working on.

@scheibelp where could I catch up on this work?

Container/Package Composition

  1. Build a package as a container (a derivative of 1, but with just one package, and hopefully with multi staged builds to minimize the redudancy) and then add the executable container to the path akin to having pulled it (2.)

My response to your (2) above I believe tries to consider what these impl concerns map back to in userspace. Additionally:

Very Specific Bind Mounts, and Proposal to Lean on Any Nearby I/O or Other Distributed Runtime Systems

  1. If we are bind mounting containers to potentially eventually link to a library inside, you could imagine having containers that exist only to serve as bind resources for some set of libraries that might require very hard to satisfy host dependencies.

Are you familiar with the spindle project? I would absolutely love to see whether we can productionize (4) by hooking up containers for bind mounts in spindle and whether they can meet/exceed the existing spindle performance requirements.

  • In general, I would love to actively try to find and plug into process/fs runtime systems (like spindle, Docker, MPI, flux, FUSE virtual fs for spack builds spack/spack#20359, hermetic/remoteable/cacheable process executions spack/spack#20407) that we already use at Livermore or elsewhere, for several specific reasons:
    • People like @mplegendre are likely to have encountered a lot of overlapping performance concerns, and could tell us whether the thing we want to do can be viewed as an instance of some more general problem we can read more papers on.
      • If they haven't prioritized laptop performance, we can do that work.
    • Because I want to understand if existing state of the art from the corporate world (like #20407) is way behind HPC systems.
    • Because I want to have the possibility of using HPC systems to execute spack builds.

Followup Work

I think at the conclusion of implementing (4) I would hope to have some form of all of:

@cosmicexplorer
Copy link

cosmicexplorer commented Jan 11, 2021

Re:

For point 3, this is similar to the idea of having isolated environments for other needs too (I remember the discussion about pex, for example). It would allow the user to have a combination of natively built packages and containers "for all those other use cases where I want to keep things separate."

I think this is is a REALLY REALLY INTERESTING THOUGHT!!! One amendment I might make to this is just to note that the concepts of "keeping all of your dependencies" vs "keep some/none of your dependencies" apply equally to spack binary packages as well as containers.

Some things that would actually justify further investment earlier in spack/spack#20359 or spack/spack#20407 would be:

  • finding some clear, repeatable indication that performance and/or throughput could be drastically improved.
  • identifying saved disk space by virtualization.

I think one really great way to make use of containers here is for me to basically forget 100% about those tickets is to be able to both demonstrate either of these work more quickly on a container approach, or to demonstrate that the container approach takes a ridiculous amount less work to maintain and develop. Especially if we can dig in (sometime this week? :D) to differences between containers vs environments -- it wouldn't be a failure to find several.

@cosmicexplorer
Copy link

Just coming back to this specific framing again:

If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do?

I am pretty sure it's correct that this seems significantly more maintainable and immediately useful to users than the alternatives, while requiring fewer prerequisites on the host to actually execute successfully.

And separately, I think those two issues spack/spack#20359 and spack/spack#20407 can now be considered as validation of your idea of a "container registry" which I see as containing basically a merkle tree CAS for composeable filesystem objects (as opposed to FUSE, which just gives you a directory).

@tgamblin
Copy link
Contributor

First replying directly to @vsoch's original post:

  1. For running this container, we'd need to be checking against the host for ABI compatibility, per what is written into the current spec.

Yes! I think there are two pieces here:

  1. I am hoping that if the container is built with Spack, that Spack's DAG representation will give us what we need to know about the packages.
  2. If we don't have that, we need to be able to look at executables and libraries (on the host or in the container) and get ABI and dependency information for them. I think this gets us what I'll call an LDG (Library Dependency Graph), and each library in the graph can have ABI information attached to it. This is what some of the folks on the Wisconsin collaboration (like @bigtrak, @woodard) have started working on.
  3. I think given two LDG's we have what we need to do a compatibility check, and to determine whether the host libraries can function properly in the container.
  4. If we're going to do more than this, we need to actually change the stack in the container. That requires installation, so what I want to do is map the LDG to something like what Spack builds internally, which I guess I'll call a Package Dependency Graph (PDG).
  5. If we can do that, we can start talking about using a solver to pick versions of packages that will be compatible with the system. What I'd like to come away with is a solver that can find a set of packages that a) work in the container and b) are compatible with the bindmounted host libraries.

The hope is that the model document will give us a model and associated semantics we need to understand this. So to some extent, I think container analysis like should falls out of the model -- containers are just collections of binaries in a system image. The hope is to define things more generally than that.

  1. Obtain a package as a container (not sure this exists).

This one I'm not as sure about. It's an interesting idea, but my question is why? We have binary packages in Spack, which contain the installation of a single package, without other dependencies included. So:

  1. I think a container would contain more than we need to reason about a single package -- a container is really more like an instantiation of a whole environment.
  2. The binaries in a container are generally built to work in that container. The worry I'd have about bindmounting libs from arbitrary containers is that we'd have to do this whole compatibility analysis again, and we'd need a uniform way to deal with things like relocation.
  3. Spack binary packages (build caches) already deal with many of these issues, and they contain only what you need to use a single package. They're also relocatable and we have code to do that. We currently rewire RPATHs, shebangs, raw prefixes text files, and (if the build prefix is long enough) paths in the strings section of binaries.
  1. Build a package as a container (a derivative of 1, but with just one package, and hopefully with multi staged builds to minimize the redudancy) and then add the executable container to the path akin to having pulled it (2.)

This one's interesting as a potential way to isolate dependencies of programs on the command line, and it is more isolation than you get from RPATH (in that each executable can effectively run in its own environment). So it would be interesting to look at the cases that need this.

That said, the model is supposed to be lower level than this. We're really getting into the mechanisms used to resolve and find dependencies here (dependency resolution/concretization, compile time search paths, runtime search paths, hard-coded paths, etc.) which are all used in one way or another in a container, but the container's a higher-level concept.

If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do? A "build" really comes down to ensuring the container technology is installed, and there is a means to pull based on a specific hash or tag, and then have containers as executables on the path (Singularity) or run them (Docker, less likely for HPC, but podman and friends are just around the corner).

We've had people ask for this -- they want bundled HPC apps in containers. I think it's a neat idea but higher level than what we'll initially be looking at here.

We do currently have a public binary cache here: https://oaciss.uoregon.edu/e4s/inventory.html. I think the initial problem to solve will be how to put binary packages together (inside or outside of a container) to match the packages on the host.

  1. If we are bind mounting containers to potentially eventually link to a library inside, you could imagine having containers that exist only to serve as bind resources for some set of libraries that might require very hard to satisfy host dependencies.

I can't think of a place where we need this yet but this is pretty similar to just installing a binary package in the container. I think we should look at which makes more sense.

@tgamblin
Copy link
Contributor

tgamblin commented Jan 12, 2021

Ok responding to @cosmicexplorer:

High-level: one of the goals of the model is going to be to clarify discussions like this, as there is clearly some variation in terminology. Hopefully writing this up will solidify some of the parts that are maybe confusing/not fleshed out so far.

Packages/containers

Because I really want to be able to do the analogous "expose a concretized (?) spack environment, fully built (?), as a dependency"

  • Which other spack environments can import during a build stage without having to concretize and download all of its recursive dependencies every time!!

I mentioned above that we want to be able to resolve using existing, built dependencies. We are already working on this; there is:

That index is a bunch of already-concretized specs, which we can download and use to concretize more specs. Effectively those would be inputs to the solver. The format there is pretty much the same as the Spack database file (opt/spack/.spack-db/index.json) because both a Spack installation and a build cache are effectively collections of concrete, already-built packages. One's a bunch of installation prefixes in $spack_prefix/opt, and the other is a bunch of binary packages.

We will be using the indices for Spack installations and binary mirrors to tell the solver what is available, and to maximize reuse of existing binaries from those places.

The goal of BUILD is to make it possible to use system packages in a similar way, by getting enough provenance for them that we can use them as inputs to a solver. So think of the binary analysis part of the project as an attempt to translate a typical Linux image into some kind of description like these indices.

@becker33 could you speak on whether "not downloading recursive dependencies of a binary package" is or could be a goal of the binary packages development?

I'm confused by this because we already install dependencies of binary packages recursively. What we cannot do is easily swap them -- right now you have to install the exact, concrete dependencies of a binary package when you install the package. What we want is to pick and choose -- e.g., if we want to use system zlib with a Spack package, we should be able to verify that it is ABI-compatible and re-wire things so that the installation uses system zlib. We should also maintain all provenance so that we know how we built that and what is deployed. @nhanford is working on the new provenance model.

On depending on environments: I don't know what this means, and I think it makes the dependency model less precise. Environments in Spack are really just a group of packages. They're "realized" in the filesystem by a "view", which in the simple case is just a unified prefix (though it can have projections that enable you to map installed packages into different layouts).

"Depending on an environment" is likely to be vague -- because you're not saying which packages in the environment you want. I think we should keep dependencies at the package level, and (potentially) you could satisfy them with environments whose component packages meet the requirements.

@tgamblin
Copy link
Contributor

tgamblin commented Jan 12, 2021

@cosmicexplorer

Hermetic Process Executions via Containers

A "build" really comes down to ensuring the container technology is installed, and there is a means to pull based on a specific hash or tag, and then have containers as executables on the path.

So this is tempting and this is what a lot of systems do, but keep in mind that the goal with BUILD is to build a provenance model that can represent any build. A container build is one kind of build, specifically for the OS in the container.

In HPC, we can't assume that containers are a requirement. We need something more fundamental so that we reason about both containerized builds and bare metal builds in the same way.

@tgamblin
Copy link
Contributor

tgamblin commented Jan 12, 2021

Express Specs in Terms of Environments Composed of Merging Package Operations

I don't think I fully understand the title here, but this is not what Specs are. They're not expressible in terms of environments; environments are expressible in terms of specs.

Hammer out a concept of "possibly-abstract spec" composed as a tree (from a parsed AST which may include e.g. edge attributes as per spack/spack#20523) which would be able to represent :

  • the conditional dependency of a depends_on('a@2', when='@3') directive in a package.py.

I don't understand how this follows from or is motivated by the other concerns on this thread, so I think you've lost me here.

I personally can quickly implement anything we might want to change about pip (or pex).

We will not be working on pip, pex, etc., at least directly, as part of BUILD. We're working on Spack.

Define a set of operations which cover every action we perform at the package level of abstraction, in order, when running spack install.

  • Define which operations are possible from each state, and what a "state" is.

Ok, we're kind of getting somewhere w.r.t the model here. But we are not building a model of installation. We are trying to describe their dependency relationships without getting into the mechanism of installation or, really, any kind of ordered, imperative operations. The installation process, or at least its implementation, is out of scope for the model document. The constraints imposed on builds by types of dependencies are not.

I realize that is probably a bit confusing at first glance, but think of this as specifying the "what" and not the "how". Any kind of installation process is going to be the "how". Containers are also probably in the category of "how" w.r.t. isolation, installation structure, deployment model, etc.

More on @vsoch's original thread:

Reproducibility then would not depend on what spack does, but on if the version of the container changes

Reproducibility in Spack comes from the metadata hashes -- essentially, the build configuration, which is then run through our (templated) package.py files. The package.py files are responsible for two things:

  1. describing the space of possible builds (the metadata parts)
  2. building a concrete spec (the methods, like install(self, spec, prefix), that build the spec.

We really do not want to tie any Spack description to a specific container artifact that can change over time. Ideally, we'd have a representation (concrete specs) that can tell us how the old version of the container and the new version of the container are different. We want a package-level description, NOT an image.

Rest of the post above

I really don't think much of the rest of @cosmicexplorer's post above is in scope for the model document. In particular, we're not talking about virtualization, deduplication, filesystems, or sandboxing as part of this work.

It may be that certain types of builds require these things, but these are the how, not the what. We want the "what".

@tgamblin
Copy link
Contributor

tgamblin commented Jan 12, 2021

RE:

If we imagine that a user wants to use spack as a container registry, instead of compiling / building on their host, would this be hard or unreasonable to do?

I am pretty sure it's correct that this seems significantly more maintainable and immediately useful to users than the alternatives, while requiring fewer prerequisites on the host to actually execute successfully.

We need to keep in mind here that HPC users still by and large build on the host. Building externally is in many cases not an option -- e.g., if the host hardware is not available in the build farm. This happens a lot in HPC. Again, though, this stuff is "how" not "what".

@vsoch
Copy link
Member Author

vsoch commented Jan 12, 2021

a binary cache at https://oaciss.uoregon.edu/e4s/inventory.html, and
an index of what's available there here: https://cache.e4s.io/build_cache/index.json

This is neat! A random idea - is this API custom for spack? I'm wondering if there is some similarity between this cache idea and other standards for registries (e.g., the distribution spec). We would want it to be easy for other package managers to follow suit, meaning creating a standard for the structures and different API calls possible to this cache (and then the registry can customize the user facing interfaces however they need).

For the container points, I think it's unlikely that you'd find a container in the wild built with spack. If it's a read only container, you would be unlikely to be able to re-build without the original recipe, so at best we would just assess compatability with the host.If it really does need something installed inside, I think we'd have to extract as much as we can about what's in the container and build it again (given read only).

If we can do that, we can start talking about using a solver to pick versions of packages that will be compatible with the system. What I'd like to come away with is a solver that can find a set of packages that a) work in the container and b) are compatible with the bindmounted host libraries.

I've been thinking about this recently - at least for libraries that have dependencies specified by the creator (e.g., Python) we are placing a lot of emphasis on those reqiurements. But I'm not convinced they are always 1) complete, 2) correct, or 3) have the proper flexibility. This is a huge human element in this process that can lead to the solver failing (e.g., pip) when the constraints are too strict.

The hope is that the model document will give us a model and associated semantics we need to understand this. So to some extent, I think container analysis like should falls out of the model -- containers are just collections of binaries in a system image. The hope is to define things more generally than that.

This make sense, and I agree. Containers are not special, just one level of abstraction using the same ideas.

This one I'm not as sure about. It's an interesting idea, but my question is why? We have binary packages in Spack, which contain the installation of a single package, without other dependencies included. So:

I was thinking for use cases that are hard if not impossible to install, such as Windows apps with wine, or some use case where the host OS absolutely won't work. Does spack even work with Windows? For relocation, it would have to be a rebuild of the container (which we could call an "install").

This one's interesting as a potential way to isolate dependencies of programs on the command line, and it is more isolation than you get from RPATH (in that each executable can effectively run in its own environment). So it would be interesting to look at the cases that need this.

That said, the model is supposed to be lower level than this. We're really getting into the mechanisms used to resolve and find dependencies here (dependency resolution/concretization, compile time search paths, runtime search paths, hard-coded paths, etc.) which are all used in one way or another in a container, but the container's a higher-level concept.

A container really could just be considered another type of install, so nothing special would need to be said in this model.

We've had people ask for this -- they want bundled HPC apps in containers. I think it's a neat idea but higher level than what we'll initially be looking at here.

Gotcha.

We do currently have a public binary cache here: https://oaciss.uoregon.edu/e4s/inventory.html. I think the initial problem to solve will be how to put binary packages together (inside or outside of a container) to match the packages on the host.

I think this could be helped by figuring out how this would work in practice - e.g., if spack were to hand of some set of information to this API, what would that endpoint look like in terms of metadata it needs, and what would the response look like. It's sort of similar to asking for a container URI, and ultimately getting a manifest and then layers, except here we would be ultimately getting some list of binaries. I'm not familiar with this cache at all, so this is just introspection.

@tgamblin
Copy link
Contributor

This is neat! A random idea - is this API custom for spack? I'm wondering if there is some similarity between this cache idea and other standards for registries (e.g., the distribution spec). We would want it to be easy for other package managers to follow suit, meaning creating a standard for the structures and different API calls possible to this cache (and then the registry can customize the user facing interfaces however they need).

YES. One of the major goals of this document is to get a written description of what needs to go into such a spec. Spack is, at least AFAIK, kind of a superset of what other package managers provide. If we can spec out the format (something you're kind of already working on) I think we could make a really general metadata model. The difference here, with this project, is that we're also trying to spec out ways to reason about such a model.

@tgamblin
Copy link
Contributor

For the container points, I think it's unlikely that you'd find a container in the wild built with spack.

Agree. Though I think I was vague. Spack can't actually build a whole container; it generates a recipe, so yeah you'd build with something else, and you'd have to start with a base image as we do not have packages for everything. There are people building apps in containers with Spack (which is really what I mean here).

I do think that the major use case here is adapting the container to the host and not using containers as a way to reuse packages in other ecosystems. I think the container's the final artifact - the thing you make with a package manager or maybe a build. Binaries from container images are, I think, not very well suited to plugging into other ecosystems (which is kind of the point of this exercise -- bridging the container/host interface)

@tgamblin
Copy link
Contributor

tgamblin commented Jan 12, 2021

I've been thinking about this recently - at least for libraries that have dependencies specified by the creator (e.g., Python) we are placing a lot of emphasis on those reqiurements. But I'm not convinced they are always 1) complete, 2) correct, or 3) have the proper flexibility. This is a huge human element in this process that can lead to the solver failing (e.g., pip) when the constraints are too strict.

Yes! Though that doesn't mean we shouldn't have a way to express the correct requirements. The goal here is to come up with a model that can be specified by humans or by machines -- and to get us to a place where the latter happens more often than the former.

@tgamblin
Copy link
Contributor

I was thinking for use cases that are hard if not impossible to install, such as Windows apps with wine, or some use case where the host OS absolutely won't work. Does spack even work with Windows?

Spack doesn't work on windows yet but we have a contract with Kitware and TechX to work on that -- it's happening now and the model is different (PE is, sadly, not ELF).

For relocation, it would have to be a rebuild of the container (which we could call an "install").

Yep. I think rebuilding is one way to do it. Though TBH you could probably relocate .so's from the container if not for the need to rewrite strings in the binary. The bigger issue IMO is that the installations in a container aren't really separated into packages you can understand -- you need to go down to the package manager level to see that (there was a tool once that I can't find anymore that queried intra-container package managers in a sort-of general way, but we want to do this w/o having to write adaptors from every package manager).

@tgamblin
Copy link
Contributor

A container really could just be considered another type of install, so nothing special would need to be said in this model.

Yes!

I think this could be helped by figuring out how this would work in practice - e.g., if spack were to hand of some set of information to this API, what would that endpoint look like in terms of metadata it needs, and what would the response look like. It's sort of similar to asking for a container URI, and ultimately getting a manifest and then layers, except here we would be ultimately getting some list of binaries. I'm not familiar with this cache at all, so this is just introspection.

Yes! We should talk about this pretty soon after you start. There is also the question of how to query an ecosystem of packages like this, how to put stuff like that json file in a database and expose the right query semantics, etc. This is definitely the direction we want to go in to interface bt/w the analysis (taking raw binaries and coming up with ABI specs) and the solvers (which would query these things). e.g., suppose you want to solve for a container that works with the host MPI and CUDA installation on some HPC system. The solver could query a binary cache and ask for packages that a) are possible dependencies of the application to be built, and b) some bounds that we know a priori, like "only for these OS's and targets".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants