Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache git repo, draft references in Circle #3009

Merged
merged 4 commits into from
Sep 10, 2019
Merged

Conversation

MikeBishop
Copy link
Contributor

@MikeBishop MikeBishop commented Sep 6, 2019

This appears to shave almost a minute off of build times. See https://circleci.com/gh/quicwg/base-drafts/tree/circle_caching for the progression.

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned that this wouldn't be very much good.

The {{ .Revision }} key means that that cache entry is only going to be good for rebuilding that revision, something that we don't do, except when we are tagging a release. The tag for {{ .Branch }} is probably good, but not for master, which will be baking in as soon as this lands (if it hasn't already).

The {{ epoch }} tag is only good if we run builds within a second of each other.

How about an idea that I just had. We can run arbitrary code to generate a file. If we were to write something to a file and then use {{ checksum ".reference-cache-key" }} we could ensure that the value rolls over daily so that we have a hot reference cache. We could do the same for the git cache so that it is mostly at most a certain date old. You can even cascade caches by reading from {{ checksum ".cache-today" }} and {{ checksum ".cache-yesterday" }}.

@MikeBishop
Copy link
Contributor Author

MikeBishop commented Sep 9, 2019

The important piece there are the multi-tiered keys in restore:

      - restore_cache:
          name: "Restoring cache - Git"
          keys:
            - v1-cache-git-{{ .Branch }}-{{ .Revision }}
            - v1-cache-git-{{ .Branch }}
            - v1-cache-git-

      - restore_cache:
          name: "Restoring cache - References"
          keys:
            - v1-cache-references-{{ epoch }}
            - v1-cache-references-

The revision will never match (except for tag builds, as you note), so it picks up the most recent git repo cached for that branch, or the most recent overall if it's a new branch. The epoch will almost never match, so it picks up the most recent reference cache generated without regard for branches. We could drop the first line of each restore_cache keys directive and get the same behavior.

Look at the run logs on this branch -- it is picking up the cache from previous runs, because the first restore key strikes out (as expected) and it rolls over to the more general key and finds the previous run.

@MikeBishop
Copy link
Contributor Author

MikeBishop commented Sep 9, 2019

And too, the proof is in the pudding -- the first build on this branch took 1:48; git caching brought it to 0:56, and reference caching brought it to 0:37.

  • Code checkout goes from 36 seconds to sub-second (shows as zero)
  • Draft build time goes from 32 seconds to 6 seconds
  • Saving/restoring the git cache costs 5-6 seconds
  • Saving/restoring the reference cache is sub-second

There's a few seconds' variation around each of those -- the last build on the branch went back up to 0:43. But still, I stand by my claim that we improve by nearly a minute per build.

(That one with 2:55 runtime spent 2:04 of that downloading issues.)

Newly-created branches are most likely to have been based on master; if no cache exists for a repo, take master as the starting point
@martinthomson
Copy link
Member

martinthomson commented Sep 10, 2019

That's great for now, while there are relatively few new commits to add and the references are fresh. However, we are creating caches for things that won't ever be used again ({{ .Revision }} and {{ epoch }}), which takes time that doesn't pay back.

More seriously, the {{ .Branch }} thing is a one-time event. We will always be picking up the cache that you create the first time master uses this build. And the references cache will be old in less than a week now, but that is the cache that will be used in perpetuity. Sure, this is good in the sense that the build will survive networking blips (assuming that we don't add more citations), but those references will be ancient.

@MikeBishop
Copy link
Contributor Author

MikeBishop commented Sep 10, 2019

We will always be picking up the cache that you create the first time master uses this build. And the references cache will be old in less than a week now, but that is the cache that will be used in perpetuity.

That would be true if we created a cache once and never wrote an updated one. But we're not -- we're creating a fresher one each time we run. We'll be picking up the cache from the last time master built, not the first.

Because each key is immutable once written, the keys have to be unique (epoch and revision). But when we're searching by prefix, that doesn't mean it's going to use the oldest possible match.

From https://circleci.com/docs/2.0/caching/#restoring-cache:

Each cache key is namespaced to the project, and retrieval is prefix-matched. The cache will be restored from the first matching key. If there are multiple matches, the most recently generated cache will be used.

Again, look at the actual behavior for builds on this branch:

  • Build 12474 uses both caches from build 12461
  • Build 12461 uses both caches from build 12459
  • Build 12459 uses both caches from build 12457
  • Build 12457 uses both caches from build 12455
  • Build 12455 uses both caches from build 12453
  • Build 12453 uses the git cache from build 12451 and creates the reference cache
  • Build 12451 uses the git cache from build 12449
  • Build 12449 creates the git cache

Each build creates a new cache instance, and each run uses:

  • The most recent git cache for the current revision, or branch, or for master, or overall
  • The most recent reference cache from any previous run

Sure, odds are that each stored cache only gets used once and then sits there collecting dust until Circle expires it 30 days later, which means we're not amortizing the time to create the cache over multiple future runs. But the combined time to store a new cache instance and retrieve the latest cache instance is massively outweighed by the time saved by having that cache.

@martinthomson
Copy link
Member

How did I completely miss the prefix-matching thing. This is good.

@martinthomson martinthomson merged commit 3588a57 into master Sep 10, 2019
@martinthomson martinthomson deleted the circle_caching branch September 10, 2019 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants