Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog base images #1605

Merged
merged 27 commits into from
Apr 12, 2024
Merged

Cog base images #1605

merged 27 commits into from
Apr 12, 2024

Conversation

andreasjansson
Copy link
Member

@andreasjansson andreasjansson commented Apr 2, 2024

This PR adds a --use-cog-base-image flag to cog {build,push,run,debug} that builds the Cog image from one of our supported base images.

Base images can be built using the new baseimage command, which is being deployed by @tempusfrangit on Cloud Build. Inside each base image is:

  • A particular version of Python
  • Optionally, a particular version of CUDA/CuDNN
  • Optionally, a particular version of Torch (TensorFlow is not supported because it's increasingly uncommonly used, at least on Replicate.com)
  • A list of popular system packages
  • Cog
  • Tini

The set of base images that are currently supported have been assembled by looking at metadata of the popular public models on Replicate.com.

When building from a base image, we only install the user's system packages, python packages, and run commands (excluding the packages that are present in the base image). Python packages are not installed with multi-stage build, because of various pyenv complexities -- we can fix that later if we need to.

@andreasjansson andreasjansson marked this pull request as draft April 2, 2024 11:54
@andreasjansson andreasjansson force-pushed the cog-base-images branch 2 times, most recently from 49a750f to 9308122 Compare April 2, 2024 12:25
@andreasjansson andreasjansson marked this pull request as ready for review April 5, 2024 17:23
@tempusfrangit tempusfrangit force-pushed the cog-base-images branch 2 times, most recently from c9b5661 to 97e62ff Compare April 7, 2024 07:25
@bfirsh
Copy link
Member

bfirsh commented Apr 7, 2024

Wahey. A couple of things:

  • I presume that this will be the default at some point? Is there any reason why it shouldn't be the default now so other people building models for Replicate will get speedups automatically?
  • The baseimage command feels like an internal thing. Could we make it hidden so it doesn't confuse users and so we don't have to worry about the design?

andreasjansson and others added 20 commits April 8, 2024 17:22
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
And make config_test.go use smaller yaml indentation. My editor replaces spaces with tabs in go files...

Signed-off-by: andreasjansson <andreas@replicate.ai>
Update and add the Build Action to build bases.

This implmenets the following
* Build Action
* Docker Layer Cache
  * Argument to specify docker layer cache within GH action cache
* Matrix output to support matrix strategy for building the image
* DockerFile cache Key generator (create a cache key based upon the
  docker file SHA256)
* Remove BuildKit Version validation as the output is highly variable
  the push should only be done from a highly controlled environment (CI)

This is all in support of consistent and reproducible builds. We may
need to think about the docker file content cache as it may make
security updates more challenging.

Image push is currently a stub and does not actually push images

Signed-off-by: Morgan Fainberg <code@tempusfrang.it>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Add improvements to the GH action and use the actual build actions. This
aleviates the potential issues with pushing later on.

Additionally, setup the scaffolding for reproducible builds (commented
out for now).

Signed-off-by: andreasjansson <andreas@replicate.ai>
* Separate Tags and Image name in the Matrix
* Use APT Cache for reproducible build
  * Consider adding further caches
* Restrict usage of the workflow
* Upload to GHCR (an additional workflow will migrate the content to
  r8.im)
* Add scoped cache that includes image-name and tag.
* Use proper metadata extraction

Signed-off-by: Morgan Fainberg <code@tempusfrang.it>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Remove the unsupported base combination

Signed-off-by: Morgan Fainberg <code@tempusfrang.it>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
@andreasjansson
Copy link
Member Author

I presume that this will be the default at some point? Is there any reason why it shouldn't be the default now so other people building models for Replicate will get speedups automatically?

I want to give this to the models team for a week to test, since it's quite a large change and we might need to support more configurations of base images. But yes, once we've validated it internally we should make it default.

The baseimage command feels like an internal thing. Could we make it hidden so it doesn't confuse users and so we don't have to worry about the design?

Good shout. @tempusfrangit mentioned this as well, we should hide that.

tempusfrangit and others added 2 commits April 9, 2024 11:08
* Torch 2.0.2 is no longer available
* Torch 0.1.5 is not available for modern cuda (e.g. > 10)
* Torch 2.20.0 was clearly a typo, 2.20 does not exist

Signed-off-by: Morgan Fainberg <code@tempusfrang.it>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
Signed-off-by: andreasjansson <andreas@replicate.ai>
@andreasjansson andreasjansson merged commit 72398c7 into main Apr 12, 2024
15 checks passed
@andreasjansson andreasjansson deleted the cog-base-images branch April 12, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants