Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow cache on self-hosted from time to time #720

Open
levonet opened this issue Jan 27, 2022 · 43 comments
Open

Extremely slow cache on self-hosted from time to time #720

levonet opened this issue Jan 27, 2022 · 43 comments
Labels
area:blob-storage bug Something isn't working

Comments

@levonet
Copy link

levonet commented Jan 27, 2022

It usually takes a few seconds to load the cache. But from time to time it takes a very long time. In my opinion, this has something to do with the slow operation of the Github API. But I can't prove it yet.

I would like to have some transparency about what happens when the action returns a response like Received 0 of XXX (0.0%), 0.0 MBs/sec. This will allow us to localize and solve the problem completely, or find a workaround.

Unknown-4

@akushneruk
Copy link

I have been facing with similar situation
image

@es
Copy link

es commented Jan 30, 2022

Seeing the same behaviour on self-hosted (GCP, europe-west3) and hosted runners. Unfortunately, it's not consistent and debugging the hosts when it's occurring doesn't point to anything on the boxes themselves.

@levonet
Copy link
Author

levonet commented Jan 31, 2022

In our case, the severity rose to critical, because it affected the operation of the CD pipeline. In our case, self-hosted runners are running in GCP southamerica-east1.
image

@levonet levonet changed the title Extremely slow cache from time to time Extremely slow cache on self-hosted from time to time Jan 31, 2022
@SophisticaSean
Copy link

Experiencing the same issues on GCP, I had to implement a timeout on the whole job because the cache action routinely takes 10 plus minutes for 30MB

@levonet
Copy link
Author

levonet commented Feb 4, 2022

Our workaround

    - name: Cache mvn repository
      uses: actions/cache@v2
      timeout-minutes: 5
      continue-on-error: true
      with:
        path: ~/.m2/repository
        key: cache-m2-${{ hashFiles('pom.xml') }}
        restore-keys: |
          cache-m2-

@vsvipul
Copy link
Contributor

vsvipul commented May 11, 2022

@levonet @akushneruk @SophisticaSean @es Are you still seeing the same behaviour on self-hosted runners? If yes, can one of you please enable debugging and the share log file here. Thank you.

@levonet
Copy link
Author

levonet commented May 12, 2022

There have been no problems in the last month.
Знімок екрана 2022-05-12 о 15 40 53

@adamshiervani
Copy link

Hey,
We are seeing this happening fairly regularly with our customers at BuildJet for GitHub Actions.

Most of the reports in this thread are from people self-hosting, but this issue doesn't seem to be unique to self-hosting, the hosted runners have the same issue. I don't know whether you have more insights into workflow runs that use native runners, but here is one example from Google Chrome Lighthouse where the actions/cache@v3 step is failing exactly like described in this thread.

As we have a lot of traffic running on our infrastructure, we really notice when bugs happen with commonly used actions, and we'd be more than happy giving providing you with insights into how users are failing. You can reach
me at adam@buildjet.com

@vsvipul

@vsvipul
Copy link
Contributor

vsvipul commented May 16, 2022

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

@txtsd
Copy link

txtsd commented May 17, 2022

We've been experiencing this intermittently on builds (not self-hosted). When it happens, the cache takes 12+ hours to retrieve. It would be nice if it were possible to set a timeout to deal with this.

@MrAlex94
Copy link

MrAlex94 commented Jul 27, 2022

@vsvipul - we are heavy users of BuildJet and this action and this is probably the most common issues we face.

Here are some runs for you:
https://github.com/WaterfoxCo/Waterfox/runs/7544425566?check_suite_focus=true#step:5:83
https://github.com/WaterfoxCo/Waterfox/runs/7543708504?check_suite_focus=true#step:5:436

We'll get this issue every week, as we push a lot of builds out. No rhyme or reason to when it occurs.

A lot are missing because I ended up deleting the workflow logs - there were just too many.

@mfn
Copy link

mfn commented Sep 27, 2022

we are heavy users of BuildJet and this action and this is probably the most common issues we face.

We're also users of BuildJet, bless their offerings, and had to disable caching in some of our jobs because of this.

@marcindulak
Copy link

This has consequences in incurring additional runner costs, if not timeout limited.

There is a newer issue opened #810, and an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MIN was introduced #810 (comment) which was then later reported as not working #810 (comment).

Just happened on a github-hosted runner, which failed to download 3GB during 1 hour (I was not aware of the issue so didn't have any remediation settings), by slowing down the download more and more, https://github.com/marcindulak/jhalfs-ci/actions/runs/3210575787/jobs/5249037125

...
Received 2143289343 of 3843618074 (55.8%), 0.6 MBs/sec
Warning: Failed to restore: Aborting cache download as the download time exceeded the timeout.
Cache not found for input keys: something

"Re-run failed" executed several minutes after the original failure, proceeded without the cache download slowdown, and finished the cache download in 2 minutes.

...
Received 3843618074 of 3843618074 (100.0%), 103.4 MBs/sec
Cache Size: ~3666 MB (3843618074 B)

@r-medina
Copy link

i've also had this happen sporadically on github hosted runners and pretty consistently on buildjet runners

@CallThemAndSayHello
Copy link

CallThemAndSayHello commented Nov 11, 2022

We now use
uses: actions/cache@v3
with that timeout suggested in #720 (comment) .

I cannot recall having seen the issue (builds failing due to this timeout) in the last while, perhaps last few months.

Never mind. The issue appeared today, with cache@v3. Luckily our macOS build timed out at this step after 10min rather than the full 60min we allow for the entire workflow.

@sammcj
Copy link

sammcj commented Dec 7, 2022

We see this with actions/cache@v3 all the time, unless you go and add a short timeout to every cache definition jobs back up waiting forever.

image

@definelicht
Copy link

This has become a significant issue for us. Currently around 50% of our actions time out at ~98%, which is particularly painful because our download is relatively large (850 MB), so we have to wait a while before checking if builds time out. We will have to migrate to a different solution if we don't find a resolution, as we cannot rely on manually monitoring CI runs.

@CallThemAndSayHello
Copy link

CallThemAndSayHello commented Dec 14, 2022

In the cases where this problem occurred for us, it seemed to be the case that restoring from the cache, according to the logs, went 0% to say 60% very quickly and then stalled or trickled the rest of the way.
I never saw one like the one shown in the screenshot https://user-images.githubusercontent.com/862951/206079348-b522b45f-9fde-404c-aa64-0c64cd335dd0.png actually complete. I had the impression if it was going to complete it was going to do it quickly and right away. Once it started reporting that it was going much more slowly, no amount of waiting would have sufficed. Thus we looked at what our restores were taking and set timeouts accordingly. Basically either the restore is finished with the 2min (for example) or it will not finish (it will not take 30min and if it starts taking such a long time, it is one that hangs).

(basically once it stops making progress, it will not complete)

@definelicht
Copy link

We now use uses: actions/cache@v3 with that timeout suggested in #720 (comment) .

This workaround does not work for us, it doesn't seem to be a supported argument (perhaps because it's in the step of an action, rather than a workflow):

Error: /.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'
Error: GitHub.DistributedTask.ObjectTemplating.TemplateValidationException: The template is not valid. /runner/_work/foo/foo/./.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'

@CallThemAndSayHello
Copy link

expanding the example:

name: Caching Primes

on: push

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Cache Primes
      id: cache-primes
      uses: actions/cache@v3
      timeout-minutes: 10                   # <---------
      with:
        path: prime-numbers
        key: ${{ runner.os }}-primes

    - name: Generate Prime Numbers
      if: steps.cache-primes.outputs.cache-hit != 'true'
      run: /generate-primes.sh -d prime-numbers

    - name: Use Prime Numbers
      run: /primes.sh -d prime-numbers

@kotewar
Copy link
Contributor

kotewar commented Jan 5, 2023

@definelicht

We have added an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MINS to timeout the download after their segment download crosses a certain amount of time. You can refer the readme for more info.

@rattrayalex
Copy link

rattrayalex commented Feb 6, 2023

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

@vsvipul this happens to us pretty often, at least a few times a week. We also use Buildjet. One case where we recently ran into this was the setup-node step at this build, if it helps you debug: https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076935062 . In that worfklow run, it happened in at least two builds. This is another one https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076927762

@kotewar
Copy link
Contributor

kotewar commented Feb 7, 2023

👋🏽 @rattrayalex, are these private repos? unfortunately we can't see the logs. Can you please elaborate on the issue you are facing.

@rattrayalex
Copy link

Ah ok. Basically the network hangs for many minutes (often long enough to hit our 20min timeout) - same as the other screenshots posted here.

Yesterday this was happening on almost every PR for us.

I'd recommend you reach out to @adamshiervani for more details as they've done some investigations of this at buildjet.

xla added a commit to filament-zone/filament that referenced this issue Feb 11, 2023
Prevent stalling builds due to a bug[^0] in Github actions.

[^0]: actions/cache#720

Signed-off-by: xla <self@xla.is>
@Voxelot
Copy link

Voxelot commented Feb 15, 2023

Also noticing this behavior while using buildjet:
https://github.com/FuelLabs/fuel-core/actions/runs/4179573798/jobs/7239680006

Using the rust-cache action sporadically takes several factors of time longer than a clean build. On average this makes self-hosted runners perform worse overall than using basic github runners (even if the actual execution stage is much faster).

@harupy
Copy link

harupy commented Feb 15, 2023

Same here: https://github.com/mlflow/mlflow/actions/runs/4179427251/jobs/7239362333

2023-02-15T00:40:21.2445589Z ##[group]Run actions/cache@v3
2023-02-15T00:40:21.2445836Z with:
2023-02-15T00:40:21.2446061Z   path: ~/.cache/pip
2023-02-15T00:40:21.2446490Z   key: ubuntu22-20230206.1-pip-13695570156a8b8ebf8bca8ca347ebfc4fcff41ffe45d4d97e95f8e36eaf2f63
2023-02-15T00:40:21.2447046Z   restore-keys: ubuntu22-20230206.1-pip-

2023-02-15T00:40:21.2447338Z   enableCrossOsArchive: false
2023-02-15T00:40:21.2447609Z   fail-on-cache-miss: false
...
2023-02-15T00:40:21.2451931Z ##[endgroup]
2023-02-15T00:40:27.6090886Z Received 167772160 of 4156517911 (4.0%), 159.8 MBs/sec
2023-02-15T00:40:28.6050840Z Received 444596224 of 4156517911 (10.7%), 211.7 MBs/sec
2023-02-15T00:40:29.6171696Z Received 713031680 of 4156517911 (17.2%), 225.5 MBs/sec
2023-02-15T00:40:30.6215322Z Received 1015021568 of 4156517911 (24.4%), 240.8 MBs/sec
2023-02-15T00:40:31.6228063Z Received 1279262720 of 4156517911 (30.8%), 243.0 MBs/sec
2023-02-15T00:40:32.6320950Z Received 1593835520 of 4156517911 (38.3%), 252.1 MBs/sec
2023-02-15T00:40:33.6329582Z Received 1849688064 of 4156517911 (44.5%), 250.9 MBs/sec
2023-02-15T00:40:34.6337576Z Received 2059403264 of 4156517911 (49.5%), 244.5 MBs/sec
2023-02-15T00:40:42.4483316Z Received 2147483647 of 4156517911 (51.7%), 129.2 MBs/sec
2023-02-15T00:40:43.4502056Z Received 2336227327 of 4156517911 (56.2%), 132.2 MBs/sec
2023-02-15T00:40:44.4515849Z Received 2533359615 of 4156517911 (60.9%), 135.4 MBs/sec
2023-02-15T00:40:45.4518907Z Received 2726297599 of 4156517911 (65.6%), 137.9 MBs/sec
2023-02-15T00:40:46.4524639Z Received 2940207103 of 4156517911 (70.7%), 141.3 MBs/sec
2023-02-15T00:40:47.4536620Z Received 3112173567 of 4156517911 (74.9%), 142.3 MBs/sec
2023-02-15T00:40:48.4555027Z Received 3288334335 of 4156517911 (79.1%), 143.5 MBs/sec
2023-02-15T00:40:49.4555218Z Received 3481272319 of 4156517911 (83.8%), 145.3 MBs/sec
2023-02-15T00:40:50.4570309Z Received 3653238783 of 4156517911 (87.9%), 146.0 MBs/sec
2023-02-15T00:40:51.4567892Z Received 3837788159 of 4156517911 (92.3%), 147.3 MBs/sec
2023-02-15T00:40:52.4566377Z Received 4001366015 of 4156517911 (96.3%), 147.6 MBs/sec
2023-02-15T00:40:53.4585354Z Received 4152323607 of 4156517911 (99.9%), 147.4 MBs/sec
2023-02-15T00:40:54.4606220Z Received 4152323607 of 4156517911 (99.9%), 142.1 MBs/sec
2023-02-15T00:40:55.4619377Z Received 4152323607 of 4156517911 (99.9%), 137.2 MBs/sec
2023-02-15T00:40:56.4631692Z Received 4152323607 of 4156517911 (99.9%), 132.6 MBs/sec
2023-02-15T00:40:57.4645638Z Received 4152323607 of 4156517911 (99.9%), 128.3 MBs/sec
2023-02-15T00:40:58.4666590Z Received 4152323607 of 4156517911 (99.9%), 124.3 MBs/sec
2023-02-15T00:40:59.4672193Z Received 4152323607 of 4156517911 (99.9%), 120.5 MBs/sec
2023-02-15T00:41:00.4690192Z Received 4152323607 of 4156517911 (99.9%), 116.9 MBs/sec
2023-02-15T00:41:01.4702672Z Received 4152323607 of 4156517911 (99.9%), 113.6 MBs/sec
2023-02-15T00:41:02.4720360Z Received 4152323607 of 4156517911 (99.9%), 110.4 MBs/sec
2023-02-15T00:41:03.4732382Z Received 4152323607 of 4156517911 (99.9%), 107.4 MBs/sec
2023-02-15T00:41:04.4747315Z Received 4152323607 of 4156517911 (99.9%), 104.6 MBs/sec

@rattrayalex
Copy link

I'm now seeing this on github's larger hosted runners as well:

Run actions/setup-node@v3
Found in cache @ /opt/hostedtoolcache/node/16.19.0/x64
Environment details
/usr/local/bin/yarn --version
1.22.19
/usr/local/bin/yarn cache dir
/home/runner/.cache/yarn/v6
Received 243269632 of 424308030 (57.3%), 231.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 199.8 MBs/sec
…
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Error: The operation was canceled.

We've had this happen a handful of times recently. This job was https://github.com/stainless-api/stainless/actions/runs/4232471756/jobs/7352268632 if that's useful to the GHA team.

@sammcj
Copy link

sammcj commented Feb 21, 2023

We're seeing it on normal GitHub hosted runners now.

@kotewar
Copy link
Contributor

kotewar commented Feb 22, 2023

@sammcj , we are aware of this problem in the GitHub hosted runners and are investigating it thoroughly.

@michael-projectx
Copy link

michael-projectx commented Mar 13, 2023

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

We currently use BuildJet and see this fairly regularly.
Recent example here: https://github.com/figurerobotics/project-x/actions/runs/4389045495/jobs/7687049171

@Sytten
Copy link

Sytten commented Mar 13, 2023

We can't see this run @michael-projectx you should paste the output of the terminal...

@michael-projectx
Copy link

@Sytten
The logs are quite long, but I'll try to post the relevant bits here:
image

To start, spinning up the container usually takes a short period of time, but in this case, over an hour.

Hopefully this is helpful:
github_action.log

@vsvipul vsvipul removed their assignment Mar 31, 2023
@thinkafterbefore
Copy link

thinkafterbefore commented May 25, 2023

Hey everyone,

I'm Adam, co-founder of BuildJet. Our customers often complain about reliability and speed issues like the one reported here. Today, we launched BuildJet Cache, a GitHub Cache alternative thats reliable, fast and free.

It’s fully compatible with GitHub’s Cache, all you need to do is replace action/cache with buildjet/cache. It works on any runner - be it official, self-hosted, or BuildJet. Whichever runner you choose, BuildJet Cache doubles your free storage, offering a generous 20 GB/repo/week storage space at no cost.

Head over to our launch post for more details: https://buildjet.com/for-github-actions/blog/launch-buildjet-cache

@Sytten
Copy link

Sytten commented May 25, 2023

@thinkafterbefore Since you hijacked the post for promotion let me push back a little.

Your cache fork is really not well done, it's just a build on top of the existing cache action with a random closed source https://www.npmjs.com/package/github-actions.cache-buildjet. Fix your shit please, we have enough supply chain attacks without legit stuff looking sus like that. Ideally do proper tagging too.

I would be much more supportive of an action that actually allowed people to not be dependant on any one provider similar to actions-cache-s3. Devs of the the github cache don't seem to care about that.

@thinkafterbefore
Copy link

@Sytten, I hear you, and I apologize if my prior message was too promotional, my intention was simply to help people solve a long-standing issue with the actions/cache.

Regarding the github-actions.cache-buildjet package, I totally get your concerns about it being closed source. To ensure full insight, we’ve open sourced it. You can review the repository here.

Lastly, as we wrote in the launch post, if people are interested in using their own object storage, whywaita/actions-cache-s3 is perfect for that. We simply wanted to alleviate that for the users.

Thanks for pushing back and calling out where we need to improve.

@mfn
Copy link

mfn commented May 25, 2023

I can only say: I'm also a happy BuildJet customer and long-time sufferer of the cache issue and I'm glad this was posted here. Probably because I'm affected, I did not find this inappropriate; the opposite, I'm glad I learned about this and can't wait to wake-up tomorrow and try it out 👌🏼

@skjnldsv
Copy link

@maxnowack
Copy link

I also started my own solution, forked the actions/cache code and modified it to place a tarball on a local volume instead upload/download all the cache everytime.
You can use it as a drop-in replacement for actions/cache@v3: https://github.com/maxnowack/local-cache

Copy link

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

@github-actions github-actions bot added the stale label Jan 15, 2024
@skjnldsv
Copy link

Still an issue

@github-actions github-actions bot removed the stale label Jan 16, 2024
@dangeReis
Copy link

Can we have an ENV variable that disables saving/restoring cache (maybe 2 separate variables) so that we can stop when other actions try to use this action.

@lsdm83114
Copy link

lsdm83114 commented Mar 29, 2024 via email

@dangeReis
Copy link

Will we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:blob-storage bug Something isn't working
Projects
None yet
Development

No branches or pull requests