Extremely slow cache on self-hosted from time to time #720

levonet · 2022-01-27T09:17:26Z

It usually takes a few seconds to load the cache. But from time to time it takes a very long time. In my opinion, this has something to do with the slow operation of the Github API. But I can't prove it yet.

I would like to have some transparency about what happens when the action returns a response like Received 0 of XXX (0.0%), 0.0 MBs/sec. This will allow us to localize and solve the problem completely, or find a workaround.

The text was updated successfully, but these errors were encountered:

akushneruk · 2022-01-29T21:02:49Z

I have been facing with similar situation

es · 2022-01-30T03:35:32Z

Seeing the same behaviour on self-hosted (GCP, europe-west3) and hosted runners. Unfortunately, it's not consistent and debugging the hosts when it's occurring doesn't point to anything on the boxes themselves.

levonet · 2022-01-31T09:59:47Z

In our case, the severity rose to critical, because it affected the operation of the CD pipeline. In our case, self-hosted runners are running in GCP southamerica-east1.

SophisticaSean · 2022-02-02T02:47:54Z

Experiencing the same issues on GCP, I had to implement a timeout on the whole job because the cache action routinely takes 10 plus minutes for 30MB

levonet · 2022-02-04T22:54:20Z

Our workaround

    - name: Cache mvn repository
      uses: actions/cache@v2
      timeout-minutes: 5
      continue-on-error: true
      with:
        path: ~/.m2/repository
        key: cache-m2-${{ hashFiles('pom.xml') }}
        restore-keys: |
          cache-m2-

vsvipul · 2022-05-11T06:08:55Z

@levonet @akushneruk @SophisticaSean @es Are you still seeing the same behaviour on self-hosted runners? If yes, can one of you please enable debugging and the share log file here. Thank you.

levonet · 2022-05-12T12:41:07Z

There have been no problems in the last month.

adamshiervani · 2022-05-12T16:34:03Z

Hey,
We are seeing this happening fairly regularly with our customers at BuildJet for GitHub Actions.

Most of the reports in this thread are from people self-hosting, but this issue doesn't seem to be unique to self-hosting, the hosted runners have the same issue. I don't know whether you have more insights into workflow runs that use native runners, but here is one example from Google Chrome Lighthouse where the actions/cache@v3 step is failing exactly like described in this thread.

As we have a lot of traffic running on our infrastructure, we really notice when bugs happen with commonly used actions, and we'd be more than happy giving providing you with insights into how users are failing. You can reach
me at adam@buildjet.com

@vsvipul

vsvipul · 2022-05-16T17:37:06Z

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

txtsd · 2022-05-17T11:42:40Z

We've been experiencing this intermittently on builds (not self-hosted). When it happens, the cache takes 12+ hours to retrieve. It would be nice if it were possible to set a timeout to deal with this.

MrAlex94 · 2022-07-27T16:20:42Z

@vsvipul - we are heavy users of BuildJet and this action and this is probably the most common issues we face.

Here are some runs for you:
https://github.com/WaterfoxCo/Waterfox/runs/7544425566?check_suite_focus=true#step:5:83
https://github.com/WaterfoxCo/Waterfox/runs/7543708504?check_suite_focus=true#step:5:436

We'll get this issue every week, as we push a lot of builds out. No rhyme or reason to when it occurs.

A lot are missing because I ended up deleting the workflow logs - there were just too many.

mfn · 2022-09-27T08:56:11Z

we are heavy users of BuildJet and this action and this is probably the most common issues we face.

We're also users of BuildJet, bless their offerings, and had to disable caching in some of our jobs because of this.

marcindulak · 2022-10-08T18:22:35Z

This has consequences in incurring additional runner costs, if not timeout limited.

There is a newer issue opened #810, and an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MIN was introduced #810 (comment) which was then later reported as not working #810 (comment).

Just happened on a github-hosted runner, which failed to download 3GB during 1 hour (I was not aware of the issue so didn't have any remediation settings), by slowing down the download more and more, https://github.com/marcindulak/jhalfs-ci/actions/runs/3210575787/jobs/5249037125

...
Received 2143289343 of 3843618074 (55.8%), 0.6 MBs/sec
Warning: Failed to restore: Aborting cache download as the download time exceeded the timeout.
Cache not found for input keys: something

"Re-run failed" executed several minutes after the original failure, proceeded without the cache download slowdown, and finished the cache download in 2 minutes.

...
Received 3843618074 of 3843618074 (100.0%), 103.4 MBs/sec
Cache Size: ~3666 MB (3843618074 B)

r-medina · 2022-11-10T21:14:05Z

i've also had this happen sporadically on github hosted runners and pretty consistently on buildjet runners

CallThemAndSayHello · 2022-11-11T12:32:52Z

We now use
uses: actions/cache@v3
with that timeout suggested in #720 (comment) .

~~I cannot recall having seen the issue (builds failing due to this timeout) in the last while, perhaps last few months.~~

Never mind. The issue appeared today, with cache@v3. Luckily our macOS build timed out at this step after 10min rather than the full 60min we allow for the entire workflow.

sammcj · 2022-12-07T03:14:25Z

We see this with actions/cache@v3 all the time, unless you go and add a short timeout to every cache definition jobs back up waiting forever.

definelicht · 2022-12-14T09:59:04Z

This has become a significant issue for us. Currently around 50% of our actions time out at ~98%, which is particularly painful because our download is relatively large (850 MB), so we have to wait a while before checking if builds time out. We will have to migrate to a different solution if we don't find a resolution, as we cannot rely on manually monitoring CI runs.

CallThemAndSayHello · 2022-12-14T10:03:36Z

In the cases where this problem occurred for us, it seemed to be the case that restoring from the cache, according to the logs, went 0% to say 60% very quickly and then stalled or trickled the rest of the way.
I never saw one like the one shown in the screenshot https://user-images.githubusercontent.com/862951/206079348-b522b45f-9fde-404c-aa64-0c64cd335dd0.png actually complete. I had the impression if it was going to complete it was going to do it quickly and right away. Once it started reporting that it was going much more slowly, no amount of waiting would have sufficed. Thus we looked at what our restores were taking and set timeouts accordingly. Basically either the restore is finished with the 2min (for example) or it will not finish (it will not take 30min and if it starts taking such a long time, it is one that hangs).

(basically once it stops making progress, it will not complete)

definelicht · 2022-12-14T12:25:16Z

We now use uses: actions/cache@v3 with that timeout suggested in #720 (comment) .

This workaround does not work for us, it doesn't seem to be a supported argument (perhaps because it's in the step of an action, rather than a workflow):

Error: /.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'
Error: GitHub.DistributedTask.ObjectTemplating.TemplateValidationException: The template is not valid. /runner/_work/foo/foo/./.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'

CallThemAndSayHello · 2022-12-14T13:23:58Z

expanding the example:

name: Caching Primes

on: push

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Cache Primes
      id: cache-primes
      uses: actions/cache@v3
      timeout-minutes: 10                   # <---------
      with:
        path: prime-numbers
        key: ${{ runner.os }}-primes

    - name: Generate Prime Numbers
      if: steps.cache-primes.outputs.cache-hit != 'true'
      run: /generate-primes.sh -d prime-numbers

    - name: Use Prime Numbers
      run: /primes.sh -d prime-numbers

kotewar · 2023-01-05T06:44:54Z

@definelicht

We have added an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MINS to timeout the download after their segment download crosses a certain amount of time. You can refer the readme for more info.

rattrayalex · 2023-02-06T16:45:36Z

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

@vsvipul this happens to us pretty often, at least a few times a week. We also use Buildjet. One case where we recently ran into this was the setup-node step at this build, if it helps you debug: https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076935062 . In that worfklow run, it happened in at least two builds. This is another one https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076927762

kotewar · 2023-02-07T09:33:37Z

👋🏽 @rattrayalex, are these private repos? unfortunately we can't see the logs. Can you please elaborate on the issue you are facing.

rattrayalex · 2023-02-07T12:26:51Z

Ah ok. Basically the network hangs for many minutes (often long enough to hit our 20min timeout) - same as the other screenshots posted here.

Yesterday this was happening on almost every PR for us.

I'd recommend you reach out to @adamshiervani for more details as they've done some investigations of this at buildjet.

Prevent stalling builds due to a bug[^0] in Github actions. [^0]: actions/cache#720 Signed-off-by: xla <self@xla.is>

Voxelot · 2023-02-15T01:28:26Z

Also noticing this behavior while using buildjet:
https://github.com/FuelLabs/fuel-core/actions/runs/4179573798/jobs/7239680006

Using the rust-cache action sporadically takes several factors of time longer than a clean build. On average this makes self-hosted runners perform worse overall than using basic github runners (even if the actual execution stage is much faster).

harupy · 2023-02-15T02:36:19Z

Same here: https://github.com/mlflow/mlflow/actions/runs/4179427251/jobs/7239362333

2023-02-15T00:40:21.2445589Z ##[group]Run actions/cache@v3
2023-02-15T00:40:21.2445836Z with:
2023-02-15T00:40:21.2446061Z   path: ~/.cache/pip
2023-02-15T00:40:21.2446490Z   key: ubuntu22-20230206.1-pip-13695570156a8b8ebf8bca8ca347ebfc4fcff41ffe45d4d97e95f8e36eaf2f63
2023-02-15T00:40:21.2447046Z   restore-keys: ubuntu22-20230206.1-pip-

2023-02-15T00:40:21.2447338Z   enableCrossOsArchive: false
2023-02-15T00:40:21.2447609Z   fail-on-cache-miss: false
...
2023-02-15T00:40:21.2451931Z ##[endgroup]
2023-02-15T00:40:27.6090886Z Received 167772160 of 4156517911 (4.0%), 159.8 MBs/sec
2023-02-15T00:40:28.6050840Z Received 444596224 of 4156517911 (10.7%), 211.7 MBs/sec
2023-02-15T00:40:29.6171696Z Received 713031680 of 4156517911 (17.2%), 225.5 MBs/sec
2023-02-15T00:40:30.6215322Z Received 1015021568 of 4156517911 (24.4%), 240.8 MBs/sec
2023-02-15T00:40:31.6228063Z Received 1279262720 of 4156517911 (30.8%), 243.0 MBs/sec
2023-02-15T00:40:32.6320950Z Received 1593835520 of 4156517911 (38.3%), 252.1 MBs/sec
2023-02-15T00:40:33.6329582Z Received 1849688064 of 4156517911 (44.5%), 250.9 MBs/sec
2023-02-15T00:40:34.6337576Z Received 2059403264 of 4156517911 (49.5%), 244.5 MBs/sec
2023-02-15T00:40:42.4483316Z Received 2147483647 of 4156517911 (51.7%), 129.2 MBs/sec
2023-02-15T00:40:43.4502056Z Received 2336227327 of 4156517911 (56.2%), 132.2 MBs/sec
2023-02-15T00:40:44.4515849Z Received 2533359615 of 4156517911 (60.9%), 135.4 MBs/sec
2023-02-15T00:40:45.4518907Z Received 2726297599 of 4156517911 (65.6%), 137.9 MBs/sec
2023-02-15T00:40:46.4524639Z Received 2940207103 of 4156517911 (70.7%), 141.3 MBs/sec
2023-02-15T00:40:47.4536620Z Received 3112173567 of 4156517911 (74.9%), 142.3 MBs/sec
2023-02-15T00:40:48.4555027Z Received 3288334335 of 4156517911 (79.1%), 143.5 MBs/sec
2023-02-15T00:40:49.4555218Z Received 3481272319 of 4156517911 (83.8%), 145.3 MBs/sec
2023-02-15T00:40:50.4570309Z Received 3653238783 of 4156517911 (87.9%), 146.0 MBs/sec
2023-02-15T00:40:51.4567892Z Received 3837788159 of 4156517911 (92.3%), 147.3 MBs/sec
2023-02-15T00:40:52.4566377Z Received 4001366015 of 4156517911 (96.3%), 147.6 MBs/sec
2023-02-15T00:40:53.4585354Z Received 4152323607 of 4156517911 (99.9%), 147.4 MBs/sec
2023-02-15T00:40:54.4606220Z Received 4152323607 of 4156517911 (99.9%), 142.1 MBs/sec
2023-02-15T00:40:55.4619377Z Received 4152323607 of 4156517911 (99.9%), 137.2 MBs/sec
2023-02-15T00:40:56.4631692Z Received 4152323607 of 4156517911 (99.9%), 132.6 MBs/sec
2023-02-15T00:40:57.4645638Z Received 4152323607 of 4156517911 (99.9%), 128.3 MBs/sec
2023-02-15T00:40:58.4666590Z Received 4152323607 of 4156517911 (99.9%), 124.3 MBs/sec
2023-02-15T00:40:59.4672193Z Received 4152323607 of 4156517911 (99.9%), 120.5 MBs/sec
2023-02-15T00:41:00.4690192Z Received 4152323607 of 4156517911 (99.9%), 116.9 MBs/sec
2023-02-15T00:41:01.4702672Z Received 4152323607 of 4156517911 (99.9%), 113.6 MBs/sec
2023-02-15T00:41:02.4720360Z Received 4152323607 of 4156517911 (99.9%), 110.4 MBs/sec
2023-02-15T00:41:03.4732382Z Received 4152323607 of 4156517911 (99.9%), 107.4 MBs/sec
2023-02-15T00:41:04.4747315Z Received 4152323607 of 4156517911 (99.9%), 104.6 MBs/sec

rattrayalex · 2023-02-21T18:01:14Z

I'm now seeing this on github's larger hosted runners as well:

Run actions/setup-node@v3
Found in cache @ /opt/hostedtoolcache/node/16.19.0/x64
Environment details
/usr/local/bin/yarn --version
1.22.19
/usr/local/bin/yarn cache dir
/home/runner/.cache/yarn/v6
Received 243269632 of 424308030 (57.3%), 231.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 199.8 MBs/sec
…
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Error: The operation was canceled.

We've had this happen a handful of times recently. This job was https://github.com/stainless-api/stainless/actions/runs/4232471756/jobs/7352268632 if that's useful to the GHA team.

sammcj · 2023-02-21T19:09:14Z

We're seeing it on normal GitHub hosted runners now.

kotewar · 2023-02-22T06:17:32Z

@sammcj , we are aware of this problem in the GitHub hosted runners and are investigating it thoroughly.

michael-projectx · 2023-03-13T16:50:53Z

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

We currently use BuildJet and see this fairly regularly.
Recent example here: https://github.com/figurerobotics/project-x/actions/runs/4389045495/jobs/7687049171

Sytten · 2023-03-13T20:33:55Z

We can't see this run @michael-projectx you should paste the output of the terminal...

michael-projectx · 2023-03-14T05:27:56Z

@Sytten
The logs are quite long, but I'll try to post the relevant bits here:

To start, spinning up the container usually takes a short period of time, but in this case, over an hour.

Hopefully this is helpful:
github_action.log

thinkafterbefore · 2023-05-25T14:44:17Z

Hey everyone,

I'm Adam, co-founder of BuildJet. Our customers often complain about reliability and speed issues like the one reported here. Today, we launched BuildJet Cache, a GitHub Cache alternative thats reliable, fast and free.

It’s fully compatible with GitHub’s Cache, all you need to do is replace action/cache with buildjet/cache. It works on any runner - be it official, self-hosted, or BuildJet. Whichever runner you choose, BuildJet Cache doubles your free storage, offering a generous 20 GB/repo/week storage space at no cost.

Head over to our launch post for more details: https://buildjet.com/for-github-actions/blog/launch-buildjet-cache

Sytten · 2023-05-25T15:13:02Z

@thinkafterbefore Since you hijacked the post for promotion let me push back a little.

Your cache fork is really not well done, it's just a build on top of the existing cache action with a random closed source https://www.npmjs.com/package/github-actions.cache-buildjet. Fix your shit please, we have enough supply chain attacks without legit stuff looking sus like that. Ideally do proper tagging too.

I would be much more supportive of an action that actually allowed people to not be dependant on any one provider similar to actions-cache-s3. Devs of the the github cache don't seem to care about that.

thinkafterbefore · 2023-05-25T16:40:46Z

@Sytten, I hear you, and I apologize if my prior message was too promotional, my intention was simply to help people solve a long-standing issue with the actions/cache.

Regarding the github-actions.cache-buildjet package, I totally get your concerns about it being closed source. To ensure full insight, we’ve open sourced it. You can review the repository here.

Lastly, as we wrote in the launch post, if people are interested in using their own object storage, whywaita/actions-cache-s3 is perfect for that. We simply wanted to alleviate that for the users.

Thanks for pushing back and calling out where we need to improve.

mfn · 2023-05-25T18:24:48Z

I can only say: I'm also a happy BuildJet customer and long-time sufferer of the cache issue and I'm glad this was posted here. Probably because I'm affected, I did not find this inappropriate; the opposite, I'm glad I learned about this and can't wait to wake-up tomorrow and try it out 👌🏼

skjnldsv · 2023-06-28T08:52:22Z

@thinkafterbefore can you publish in https://github.com/marketplace?type=&verification=&query=buildjet/cache please? :)

maxnowack · 2023-06-28T10:23:26Z

I also started my own solution, forked the actions/cache code and modified it to place a tarball on a local volume instead upload/download all the cache everytime.
You can use it as a drop-in replacement for actions/cache@v3: https://github.com/maxnowack/local-cache

github-actions · 2024-01-15T08:32:15Z

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

skjnldsv · 2024-01-15T08:53:35Z

Still an issue

dangeReis · 2024-03-28T23:50:50Z

Can we have an ENV variable that disables saving/restoring cache (maybe 2 separate variables) so that we can stop when other actions try to use this action.

lsdm83114 · 2024-03-29T00:45:42Z

Yes

…

On Thu, Mar 28, 2024, 17:51 dangeReis ***@***.***> wrote: Can we have an ENV variable that disables saving/restoring cache (maybe 2 separate variables) so that we can stop when other actions try to use this action. — Reply to this email directly, view it on GitHub <#720 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGG5ZMW222E4TMLFABQURQDY2SUIDAVCNFSM5M5KIEQ2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBSGYZTGMRVHE2A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

dangeReis · 2024-04-01T15:09:13Z

Will we?

levonet changed the title ~~Extremely slow cache from time to time~~ Extremely slow cache on self-hosted from time to time Jan 31, 2022

stmcginnis mentioned this issue Apr 27, 2022

Restore Golang Cache Step in GitHub workflow is timing out vmware-tanzu/community-edition#3961

Closed

vsvipul self-assigned this May 7, 2022

vsvipul added bug area:blob-storage labels Aug 17, 2022

sjrl mentioned this issue Sep 5, 2022

Investigate failures on the HF models caching job deepset-ai/haystack#3146

Closed

marcindulak mentioned this issue Oct 8, 2022

Action can wait for at least 27 minutes when no progress is being made on the download #810

Closed

vsvipul mentioned this issue Jan 4, 2023

Download speed on self-hosted runners is significantly slower #928

Closed

xla added a commit to filament-zone/filament that referenced this issue Feb 11, 2023

chore(ci): Cap cache retrieval step

c4d0e5a

Prevent stalling builds due to a bug[^0] in Github actions. [^0]: actions/cache#720 Signed-off-by: xla <self@xla.is>

harupy mentioned this issue Feb 15, 2023

Allow actions/cache step to fail mlflow/mlflow#7837

Merged

33 tasks

vsvipul removed their assignment Mar 31, 2023

This was referenced Jun 29, 2023

Allow disabling cache cypress-io/github-action#953

Open

fix: cypress template nextcloud/.github#191

Merged

github-actions bot added the stale label Jan 15, 2024

github-actions bot removed the stale label Jan 16, 2024

arthurhobspice mentioned this issue Jul 30, 2024

Extremely slow Post Setup steps for actions/cache@v3 #1442

Closed

Extremely slow cache on self-hosted from time to time #720

Extremely slow cache on self-hosted from time to time #720

Comments

levonet commented Jan 27, 2022

akushneruk commented Jan 29, 2022

es commented Jan 30, 2022

levonet commented Jan 31, 2022

SophisticaSean commented Feb 2, 2022

levonet commented Feb 4, 2022

vsvipul commented May 11, 2022

levonet commented May 12, 2022

adamshiervani commented May 12, 2022

vsvipul commented May 16, 2022

txtsd commented May 17, 2022 • edited Loading

MrAlex94 commented Jul 27, 2022 • edited Loading

mfn commented Sep 27, 2022

marcindulak commented Oct 8, 2022

r-medina commented Nov 10, 2022

CallThemAndSayHello commented Nov 11, 2022 • edited Loading

sammcj commented Dec 7, 2022

definelicht commented Dec 14, 2022

CallThemAndSayHello commented Dec 14, 2022 • edited Loading

definelicht commented Dec 14, 2022

CallThemAndSayHello commented Dec 14, 2022

kotewar commented Jan 5, 2023

rattrayalex commented Feb 6, 2023 • edited Loading

kotewar commented Feb 7, 2023

rattrayalex commented Feb 7, 2023

Voxelot commented Feb 15, 2023 • edited Loading

harupy commented Feb 15, 2023 • edited Loading

rattrayalex commented Feb 21, 2023

sammcj commented Feb 21, 2023

kotewar commented Feb 22, 2023

michael-projectx commented Mar 13, 2023 • edited Loading

Sytten commented Mar 13, 2023

michael-projectx commented Mar 14, 2023

thinkafterbefore commented May 25, 2023 • edited Loading

Sytten commented May 25, 2023

thinkafterbefore commented May 25, 2023

mfn commented May 25, 2023

skjnldsv commented Jun 28, 2023

maxnowack commented Jun 28, 2023

github-actions bot commented Jan 15, 2024

skjnldsv commented Jan 15, 2024

dangeReis commented Mar 28, 2024

lsdm83114 commented Mar 29, 2024 via email

dangeReis commented Apr 1, 2024

txtsd commented May 17, 2022 •

edited

Loading

MrAlex94 commented Jul 27, 2022 •

edited

Loading

CallThemAndSayHello commented Nov 11, 2022 •

edited

Loading

CallThemAndSayHello commented Dec 14, 2022 •

edited

Loading

rattrayalex commented Feb 6, 2023 •

edited

Loading

Voxelot commented Feb 15, 2023 •

edited

Loading

harupy commented Feb 15, 2023 •

edited

Loading

michael-projectx commented Mar 13, 2023 •

edited

Loading

thinkafterbefore commented May 25, 2023 •

edited

Loading