Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Use sccache-action instead of rust-cache #4468

Merged
merged 3 commits into from
Feb 1, 2024

Conversation

jalil-salame
Copy link
Contributor

It reduced ci times for fluke from 9min to 1min in MacOS (see this issue) so I thought it would be worth a try here too.

sccache does not have a configurable key for the cache so it might not work at all for us.

@jalil-salame
Copy link
Contributor Author

You have to allow the new action in the repo configuration. It's mozilla-actions/sccache-action@v0.0.3

Copy link
Contributor Author

@jalil-salame jalil-salame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding aditional explanations to the changes

@@ -130,7 +130,7 @@ def __init__(self, name, path, key):

class CacheRustStep(ActionStep):
def __init__(self, name, key):
super().__init__(name, action="Swatinem/rust-cache@v2", params={"key": key})
super().__init__(name, action="mozilla-actions/sccache-action@v0.0.3")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sccache has no way of adding a custom key yet(?). Might or might not trip us up

ci/generate-workflows.py Show resolved Hide resolved
ci/generate-workflows.py Show resolved Hide resolved
@jalil-salame
Copy link
Contributor Author

I might restart the generate ci script rewrite effort...

Thanks for all the work! I've really enjoyed using wezterm and I hope I can do some more work on improving this c:

@wez
Copy link
Owner

wez commented Oct 30, 2023

You have to allow the new action in the repo configuration. It's mozilla-actions/sccache-action@v0.0.3

I've added this action to the allow list; I think you might need to push/bump again this PR for the actions to run.

@jalil-salame
Copy link
Contributor Author

OpenSuse Leap is failing because it cannot find tar. I am assuming it is not installed.

Should I add it as a dependency to get-deps, or as a CI step like git and curl?

@jalil-salame
Copy link
Contributor Author

opensuse_tumbleweed tests are failing. I don't think sccache could cause those failures, but I'm not optimistic.

@jalil-salame
Copy link
Contributor Author

Besides the opensuse errors, the results seem mixed:

Results: (click me)

Clean Cache

Type Build Time Test Time Cache Hits Cache Misses
macos (Intel) 25:58 12:47 17 2780
macos (ARM) 18:53 N/A N/A N/A
debian11 19:13 11:12 432 1095
debian12 18:19 11:56 36 1491
fedora38 17:06 11:39 380 1147
ubuntu22.04 15:28 12:09 417 1110

Warm? Cache

Type Build Time Test Time Cache Hits Cache Misses
centos7 21:33 13:52 0 1528
centos8 14:13 10:10 530 1268
centos9 18:09 11:26 630 1168
windows 37:06 19:42 0 702
fedora36 18:12 10:46 637 1158
fedora37 14:34 07:58 376 1151
debian10.3 16:31 09:22 543 984
ubuntu20.04 13:11 10:06 524 1274

Conclusion

The environment is not optimal to test the impact:

  • sccache is competing for cache storage with other actions
  • cargo-nextest was not cached for a few builds
    • sccache also speeds up the build of cargo-nextest which is both good (faster compiles) and bad (pollutes the cache)
  • Some build have suspiciously few cache hits (centos7, macos, and windows stand out)

My guess is sccache is better, but it is hard to tell.

Potential Improvements

sccache is not caching crate downloads (it is not supposed to). Adding a cargo vendor step might improve compile times further.

@wez
Copy link
Owner

wez commented Oct 30, 2023

OpenSuse Leap is failing because it cannot find tar. I am assuming it is not installed.

Thanks for looking at this!

I'm considering removing the opensuse builds from the GH CI because they are the most unreliable platforms: they often break in weird ways for days at a time.

I recently added COPR builds for several platforms (https://wezfurlong.org/wezterm/install/linux.html#installing-on-fedora-and-rpm-based-systems-via-copr) which makes it less critical to have GH run so many itself. Provided that failures of the COPR build bubble up and are noticed by me; that's currently a missing piece.

Should I add it as a dependency to get-deps, or as a CI step like git and curl?

My mental model is: it goes in get-deps unless we need it before we're in a position to run get-deps.

@wez
Copy link
Owner

wez commented Oct 30, 2023

re: mixed results; I'm increasingly of the opinion that GH actions runners are under powered for the projects that I work on; the disk space in the runners isn't especially generous and the total amount of storage for caches is limiting when there are multiple platforms to support.

wezterm's CI hasn't quite reached the limits of cache space, but I did hit them over in kumomta and decided to move its CI to a locally hosted woodpecker ci, which makes a significant difference (40 minute unreliable builds on GH -> 3 mins cached or ~5 mins cold on my woodpecker box). That works for kumomta because the number of platforms are small and I have just enough hardware for that.

Based on that experience, I suspect that adding more platforms to wezterm will introduce similar reliability issues with caching and my lead to disabling caching for some of them, or just punting them to other build systems (like COPR).

@jalil-salame
Copy link
Contributor Author

re: mixed results; I'm increasingly of the opinion that GH actions runners are under powered for the projects that I work on; the disk space in the runners isn't especially generous and the total amount of storage for caches is limiting when there are multiple platforms to support.

I've also hit the limits of the CI machines in a very small project (which was installing ROCm and thus using 10+GiB of space).

sccache should help with cache utilization; the cache is now (mostly) shared for all Linux builds.

cargo vendor should be shared for all builds if I end up adding that.

@jalil-salame jalil-salame changed the title ci: Use scccache-action instead of rust-cache ci: Use sccache-action instead of rust-cache Oct 30, 2023
@jalil-salame
Copy link
Contributor Author

macos caching is working! 🎉

45min vs 56min!

Cache hits are better on the weird workflows (centos7 and macos).

Windows seems to still be having trouble.

I'll have more time to look at this on Wednesday hopefully. I need to fix the tar dep on SUSE and add caching for the crate downloads.

@jalil-salame
Copy link
Contributor Author

Caching the dependencies with cargo vendor seems like a failed experiment; GH CI, with all its warts, does have a good network connection so we only save at most ~30s. It is better for crates.io though so we might want to keep it if it doesn't hurt anything.

@jalil-salame
Copy link
Contributor Author

Now that tar is installed in OpenSUSE, both are (expectedly) failing the tests!

Conclusion of the experiment:

tl;dr sccache is not a big improvement but it helps with cold caches and cache utilization. Vendoring is not useful for wezterm but maybe crates.io will like it.

  1. sccache did not bring the massive improvement I had hoped, but it does do some nice things:
  • It uses the cache much better (I estimated ~5.5GiB of cached artifacts (GH rate limited me so I only had ~1/3 of the data))
  • It caches intermediate results (as seen in the cache hits, the cache is populated as the action runs, so a cold start is still very good (i.e. centos7 and centos8 both start at the same time, thus some of the artifacts are compiled by centos7 some by centos8 and are being shared by sccache reducing the load).
  • It works on macos 🎉
  1. Vendoring dependencies is negligible, the downloads take ~30s only
  2. OpenSUSE is flaky
  3. I dislike the Python script, next PR I do will replace it with xtasks or smth (not soon sadly)

@wez
Copy link
Owner

wez commented Jan 21, 2024

Sorry it took a while to get back to this. I like the idea of using less cache space. I think the changes are reasonable also. Would you mind rebasing this PR? I think it is mergeable!

@jalil-salame
Copy link
Contributor Author

I'm gonna have to look at this again, after the rebase I apparently messed something up as the actions use both cache methods now 🥲

Hopefully I can do this tomorrow, otherwise it might take a week...

@jalil-salame
Copy link
Contributor Author

I'm gonna have to look at this again, after the rebase I apparently messed something up as the actions use both cache methods now 🥲

Hopefully I can do this tomorrow, otherwise it might take a week...

I feel so dumb... That was all me, I did it... everything is working as intended...

For future confused me:

We use sccache to cache the compilation artifacts, and we use cache to cache the dependencies so that we are nice to crates.io.

I was planning on squashing everything into two commits; the python changes and the workflow changes. Are you okay with me doing that?

@wez
Copy link
Owner

wez commented Jan 29, 2024

I will often just click the "squash and merge" button if I feel like the number of commits in a PR is a bit high/verbose.
I'm happy to do that here to save you some effort, or I can wait for you to amend your PR history!

jalil-salame added a commit to jalil-salame/wezterm that referenced this pull request Jan 29, 2024
It reduced ci times for fluke from 9min to 1min in MacOS (see [this
issue](bearcove/loona#118)).

This [comment](wez#4468 (comment))
has the TL;DR of the results from adding it. The summary of that is
caching works on MacOS now and it uses the github cache much better.
@jalil-salame
Copy link
Contributor Author

I will often just click the "squash and merge" button if I feel like the number of commits in a PR is a bit high/verbose. I'm happy to do that here to save you some effort, or I can wait for you to amend your PR history!

It was a nice moment for me to learn about interactive rebases so I took the opportunity c: The history should be much tidier now.

It reduced ci times for fluke from 9min to 1min in MacOS (see [this
issue](bearcove/loona#118)).

This [comment](wez#4468 (comment))
has the TL;DR of the results from adding it. The summary of that is
caching works on MacOS now and it uses the github cache much better.
This doesn't improve the build times, but it does reduce the requests we
make to `crates.io`.
@wez wez merged commit 309b7d1 into wez:main Feb 1, 2024
12 of 13 checks passed
wez pushed a commit that referenced this pull request Feb 1, 2024
It reduced ci times for fluke from 9min to 1min in MacOS (see [this
issue](bearcove/loona#118)).

This [comment](#4468 (comment))
has the TL;DR of the results from adding it. The summary of that is
caching works on MacOS now and it uses the github cache much better.
@wez
Copy link
Owner

wez commented Feb 1, 2024

Thank you again!

@jalil-salame jalil-salame deleted the sccache-action branch February 1, 2024 20:31
@jalil-salame
Copy link
Contributor Author

Looking at this a month after, the cache usage looks much better 🎉

$ gh cache list --json sizeInBytes -q '([.[].sizeInBytes] | add) / (1024 * 1024)' --limit 16716
7539.66
$ gh cache list --json sizeInBytes -q '([.[].sizeInBytes] | add) / (1024 * 1024)' --limit 16716 --key sccache
6535.33

Numbers in MiB.

On other news; I am preparing myself mentally to rewrite the ci script so I can tackle #3360 . The basic idea I have is to use a needs dependency to chain the CI executions. I might also take the plunge and look at the diff of the schedule vs push actions, and see if I can merge them into a monstrosity that doesn't need two files (and is hopefully more maintainable). Does that sound any good?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants