Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable always writing cache to support hermetic build systems #109

Open
wchargin opened this issue Nov 21, 2019 · 36 comments
Open

Enable always writing cache to support hermetic build systems #109

wchargin opened this issue Nov 21, 2019 · 36 comments
Assignees
Labels
enhancement New feature or request stale

Comments

@wchargin
Copy link

I’d like to use actions/cache to cache my Bazel build state, which
includes dependencies that have been fetched, binaries and generated
code that have been built, and results for tests that have run. Bazel is
a hermetic build system, so the standard Bazel pattern is to always use
a single cache. Bazel will take care of invalidation at a fine-grained
level: if you only change one source file, it will only re-build and
re-test targets that depend on that source file.

Thus, the pattern that makes sense to me for Bazel projects is to always
fetch the cache and always store the cache. We can always fetch the
cache by using a constant cache key, but then the cache will never be
stored. Bazel doesn’t have a single package-lock.json-style file that
can be used as a cache key; it’s the combination of all build and source
files in the whole repository. We could key use the Git tree (or commit)
hash as a cache key, but this would lead to storing a mountain of
caches, too, which seems wasteful.

Ideally, the fetched cache would be taken from origin/master, but
really taking it from any recent commit should be fine, even if that
commit was in a broken or failing state.

On my repository, it takes 33 seconds to save the Bazel cache after a
successful job, but on a clean cache it takes 2 minutes to fetch remote
dependencies and 26 minutes to build all targets. I would be more than
happy to pay those 33 seconds every time if it would save half an hour
in the rest of the build!

For comparison, on Travis we achieve this by simply pointing to the
Bazel cache directory:
https://github.com/tensorflow/tensorboard/blob/1d1bd9a237fe23a3f2c31282ab44e7dfbcac717c/.travis.yml#L30-L32

@chrispat
Copy link
Member

@wchargin this is an interesting topic thanks for bringing it up.

In this example if we had a way to skip storing the cache unless the run was on master you could use the git commit as part of your key and get the desired behavior without writing a new cache for each run of a pull request.

Do you think that would work for you?

wchargin added a commit to wchargin/tensorboard that referenced this issue Nov 21, 2019
Summary:
GitHub Actions is a new first-party CI service offered by GitHub. It
requires no extra permissions. Its concurrency limits are appealing,
at 20 workflows per repo (1 workflow ≈ 1 commit) and concurrent jobs
ranging from 20 (free tier) to 180 (enterprise tier), with the option to
run on your own servers if this isn’t enough.

This commit adds a workflow definition for our CI. It’s similar to our
existing Travis workflow, except that it only runs on Python 3.6 for now
due to a bug in the Python 2.7 runtime that has been fixed on GitHub’s
end but not yet deployed (see note inline). I also added a run of our
self-diagnosis script for good measure. (The diagnosis script always
exits successfully, and runs in about 4 seconds.)

The high job concurrency limits let us save some time by running the
lint steps in parallel and just once rather than sequentially and in
every cell of the build matrix. The GitHub Actions VMs appear to have
very little overhead: the entire elapsed time for the `lint-yaml` job is
12 seconds, of which 6 seconds is checking out the repo. Empirically,
there is very little latency (order of seconds) between pushing a commit
and seeing real work being done on the VMs.

GitHub Actions offers caching. From what I can glean, each cache
directory (e.g., “the Bazel cache” or “the Node cache”) is tarred and
gzipped; each such archive must not exceed 400 MB. This is enough space
to cache our `node_modules` and our Bazel state.\* But I haven’t done
so, pending (a) clarity on the recommended way to cache Node modules and
(b) better support for Bazel-style unicaches (see notes inline). Even
without any caching, the total workflow time is still about the same as
the best-case Travis build time because of the improved concurrency.

\* Sometimes: in my tests, sometimes Bazel could be cached successfully,
and other times it was well over the limit (595 MB out of 400 MB). I’m
not quite sure what that’s about.

[nm]: actions/cache#67
[bzl]: actions/cache#109

Test Plan:
Note that this commit triggers a GitHub Actions workflow that succeeds.

wchargin-branch: gh-actions
@wchargin
Copy link
Author

wchargin commented Nov 22, 2019

@chrispat: Yeah, that sounds reasonable! At a glance, I don’t see a way
to save the cache only if it’s running on master… but perhaps I could
hack something together that restores the cache to its initial state at
the end of the job for builds that aren’t running on master—just as
a proof of concept to see how this strategy works.

If I understand correctly, we’d still be proliferating caches with each
commit to master, right? I understand that cache eviction kicks in, but
it still seems unfortunate, especially if I have to worry about other
caches (e.g., node_modules) being evicted prematurely.

@hvr
Copy link

hvr commented Nov 22, 2019

For the record, cabal's Nix-style store/cache also falls into this category; see my comment at #38 (comment)

@chrispat
Copy link
Member

@wchargin given the version of the sources is part of the bazel caching algorithm what key do you think should be used to prevent a huge number of updates? My assumption is travis is uploading new caches essentially every build if they are just looking at changes to the cache directory.

@wchargin
Copy link
Author

Yes, Travis uploads new caches every build. And you’re right that this
is a performance problem: Travis re-uploads the entire cache directory
from scratch every build, which can take minutes. (Also, the build
doesn’t report success until this upload has completed, and this
upload can cause an otherwise successful build to time out and fail,
which is super frustrating…)

We do want to update the cache on every build, but it should be cheap to
perform a partial update of only files that changed, rsync-style. The
action cache will be updated on basically every commit, but is small
(~500K). The fetch caches will be very rarely updated, and can be large
(hundreds of MB). And the build cache for any given target will be
updated whenever that target changes, but not only if unrelated targets
change, and can be of varying sizes (typically fairly small, but there
are lots of them).

I see that actions/cache currently tars and gzips everything into a
single bundle, but it would be much more effective for caches in the
style of Bazel/Nix/Cabal to support incremental updates, perhaps by
using use a content-addressable store like that of Git itself. What do
you think?

@chrispat
Copy link
Member

For something like bazel I wonder if having a truly remote cache is actually a better option https://github.com/buchgr/bazel-remote. This is not something we are going to get around to implementing anytime soon but it is something we can consider for the future.

The model we have for caching enables to user to control the key and also requires that all caches are immutable by key. While that is not ideal for all scenarios it does work generally well for a large number of different technology stacks and scenarios. This immutable nature make incremental update untenable and likely not possible. Even if we could incrementally update the cache the download on next run is going to have to be the entire cache as we have to provision a fresh VM for each job.

@mborgerson
Copy link

mborgerson commented Nov 26, 2019

I believe I have a similar use case to the issue described here, and ideally would like to see an update-cache option added to the action, but I've worked around the issue by leveraging the restore-keys option.

A project of mine consists largely of C files, and naturally a significant portion of my CI cycle time is spent in compilation. To speed things up, I've employed ccache, which will opportunistically recycle previously built object files when it detects that the compilation would be the same for the current build. This has a dramatic performance improvement on CI times. In order to do this though, I need some persistence of storage between workflow runs in order to save and restore ccache's cache directory. Of course, as the code base evolves, the cache of object files will change too.

I was pleased to discover actions/cache, as it fits my use case very nicely; but, I was surprised to find that when a cache hit occurs, actions/cache will not attempt to update the cache at all, and there's not an option to request such update.

To work around this, I do the following:

    - name: Initialize Compiler Cache
      id: cache
      uses: actions/cache@v1
      with:
        path: /tmp/xqemu-ccache
        key: cache-${{ runner.os }}-${{ matrix.configuration }}-${{ github.sha }}
        restore-keys: cache-${{ runner.os }}-${{ matrix.configuration }}-

It works like this: when the cache is loaded for a workflow, there will be an initial cache miss because the cache key contains the current commit sha. actions/cache will fall back to the most recently added cache via restore-keys prefix matching policy, then after the build has completed, create a new cache entry to satisfy the initial cache miss.

This solution seems to work very well for me, and hopefully this will be useful to others with a similar use case. Ideally though, I think actions/cache should just support updating the cache, to a new immutable revision perhaps--as I have done above.

@wchargin
Copy link
Author

Having the caches be immutable makes a lot of sense. Immutable caches
seem perfectly compatible with incremental updates—in fact, this is a
strong point of Git. If your repository has 100 top-level directories
each with 100 files, then you have 101 trees and 10000 blobs; if you
change just one of those files, then you have 103 trees and 10001 blobs,
not 202 trees and 20000 blobs. Does this make sense, or am I missing
something?

A truly remote cache is an appealing option, but comes with a lot more
operational overhead for the user. Storing files is much easier than
running a server.

Downloading the full latest cache on each run may not be perfect, but
it’s still an improvement over rebuilding all the artifacts, faster by
about 20 minutes in my case.

@mborgerson
Copy link

mborgerson commented Nov 26, 2019

@wchargin I agree that immutability is acceptable on the condition that we can restore and create a new cache as I have described above (though it can be quite wasteful as you mentioned). My guess is that this particular use case will be desirable by many projects. Perhaps the documentation could simply be updated to demonstrate this type of use case? To me, it wasn't immediately obvious. My suggestion would be to mention using ${{ github.sha }} in the key.

@wchargin
Copy link
Author

Right; immutability is space-wasteful if the caches are stored
independently (which will happen if you use ${{ github.sha }}) but not
if they’re stored as part of one content-addressed store (which would
require changes to the actions/cache implementation).

@chrispat
Copy link
Member

A truly remote cache is an appealing option, but comes with a lot more
operational overhead for the user. Storing files is much easier than
running a server.

I was thinking we would run that server on behalf of the user so the operational overhead should be essentially the same is it would be for the existing cache action. I am not 100% sure that is the best option but it seems like it might be a really good one for build systems that support it.

@wchargin
Copy link
Author

A truly remote cache is an appealing option, but comes with a lot more
operational overhead for the user. Storing files is much easier than
running a server.

I was thinking we would run that server on behalf of the user so the
operational overhead should be essentially the same is it would be for
the existing cache action.

Oh, that would be fantastic! —being able to just point Bazel to a remote
cache
provided by a GitHub-managed action would be a huge value-add
for us compared to other CI services.

@dsilva
Copy link

dsilva commented Dec 28, 2019

Similar use case: ~/.cache/sccache for sccache works like the bazel cache. For now it's probably easier to point sccache at S3 or GCS to avoid the issues described above, and it would be nice if GitHub ran an sccache store as well.

@jacquesbh
Copy link

Hi!

I have the same issue I think with composer cache, in PHP.

Composer saves every download it makes in cache and most of the time this cache is globally used.
As example on a developer machine the cache is growing all the time with new releases.

I've been using the github.sha in my cache keys since it allows me to re-save the cache and avoid the case where it hits the cache but new versions of dependencies exist and it's always downloading them since they're not in cache.

jacquesbh added a commit to monsieurbiz/SyliusAlertMessagePlugin that referenced this issue Mar 24, 2020
@jvolkman
Copy link

Specifically for Bazel: the cache protocol is pretty simple. I wonder if it would be feasible to write a service that simply proxies to Github's own artifactcache service and stand it up locally? Not sure how many cache keys Github allows.

nanddalal added a commit to nanddalal/webapp that referenced this issue May 23, 2022
GitHub actions caching will never update the cache if there was a cache
hit, but for bazel we want to do this since bazel guarantees hermetic
builds and will update the cache if needed. See
actions/cache#109 for more context. As such,
we adjust the cache key logic to work better with github actions as per
the documentation:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#matching-a-cache-key
Now we will always have a cache miss and load the latest restore key and
then we will always upload to a new cache key. This adds a couple
minutes for saving the cache, but building from scratch is 12+ min, so
it is worth it.
@pramodka-revefi
Copy link

pramodka-revefi commented Jul 25, 2022

Since this task is still open, What's the current best practice for bazel caching + github actions? Does someone have a snippet of their github workflow they can share?

Update: Just sharing the CI pipeline yaml with caching that we went with, hopefully it'll help the next person who lands on this task. (Slightly more permissive approach to @nanddalal above)

@github-actions
Copy link

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

@github-actions github-actions bot added the stale label Feb 13, 2023
@mihaimaruseac
Copy link

I still think this is useful to have and not close

@yongtang
Copy link

yongtang commented Feb 13, 2023

I think this will also help reduce the overall cost of compute resources on GitHub actions, as many open source projects can minimize the GitHub actions minutes they use for every run.

@github-actions github-actions bot removed the stale label Feb 14, 2023
@jsoref
Copy link
Contributor

jsoref commented Mar 27, 2023

So... If you're willing to use an a/b system, you could probably do something like:

- uses: actions/cache/restore@v3
  with:
    key: preferred
    restore-key: fallback
- run: do-work
- if: no-cache
  uses: actions/cache/save@v3
  with:
    key: fallback
- if: no-cache
  uses: actions/cache/save@v3
  with:
    key: preferred
- if: used-preferred-cache
  uses: ./delete-cache
  with:
    key: fallback
- if: used-preferred-cache
  uses: actions/cache/save@v3
  with:
    key: fallback
- if: used-fallback-cache
  uses: actions/cache/save@v3
  with:
    key: preferred
- if: used-preferred-cache
  uses: ./delete-cache
  with:
    key: preferred
- if: used-preferred-cache
  uses: actions/cache/save@v3
  with:
    key: preferred

Notes:

  • used-fallback-cache, used-preferred-cache, and no-cache aren't technical things, but actions/cache has outputs that you can use to construct the concepts
  • You might be able to condense the various stages, but you definitely want to ensure that at least one of the caches is available, and given how simple the with:'s will be, it might be simpler just to have lots of steps than to try to be incredibly fancy about it.
  • in addition to wrapping delete-cache into an action, you could wrap the entire delete+save pattern into an action

./delete-cache can be implemented using the APIs that were made available in circa June 27, 2022:
https://github.blog/changelog/2022-06-27-list-and-delete-caches-in-your-actions-workflows/

@github-actions
Copy link

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

@github-actions github-actions bot added the stale label Oct 13, 2023
@Frenzie
Copy link

Frenzie commented Oct 13, 2023

Bots suck.

@github-actions github-actions bot removed the stale label Oct 14, 2023
gallais added a commit to msp-strath/MSPweb that referenced this issue Oct 17, 2023
gallais added a commit to msp-strath/MSPweb that referenced this issue Oct 17, 2023
@IanButterworth
Copy link

I think this can be closed as it's now released in v4

@jsoref
Copy link
Contributor

jsoref commented Jan 18, 2024

I don't see how v4 changes anything. Either it was already possible (and I think my suggestions and others show that there are ways to do something) or it might still not be possible.

If it's now possible as of v4, it'd be nice if someone put together an actual example of how to do it.

@IanButterworth
Copy link

My bad. v4 has a save-always option. But this would be more like a save-overwrite option?

@jsoref
Copy link
Contributor

jsoref commented Jan 18, 2024

I mean, I'd probably just use an epoch time value with a fallback of none:

        key: cache-${{ steps.time.outputs.epoch }}
        restore-keys: cache-

That'd result in it always writing one. Older caches will get wiped out as they become least recently used. Sure, you pay a bit to store a duplicate of the cache (or you could use actions/cache/restore and actions/cache/save and only conditionally call actions/cache/save if you made any changes...), but, so what?

@ephemient
Copy link

but, so what?

That excess space usage causes other caches to get dropped too.

@jsoref
Copy link
Contributor

jsoref commented Jan 19, 2024

Then use restore & saveseparately and use an if: to only use save when you have changes.

If you're being really aggressive, you might be able to portion the cache into lots of pieces and have steps to calculate and retrieve/save them.

There will be a trade-off between how many steps you need to run and how big your cache pieces are.

gallais added a commit to msp-strath/MSPweb that referenced this issue Feb 5, 2024
Copy link

github-actions bot commented Aug 6, 2024

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

@github-actions github-actions bot added the stale label Aug 6, 2024
@Frenzie
Copy link

Frenzie commented Aug 6, 2024

@mihaimaruseac
Copy link

This is still an issue worth resolving

@TheGiraffe3
Copy link

Why wasn't the stale label removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests