Skip to content

feat: runs-on rollout part 1 (DX-365)#17026

Merged
erikburt merged 2 commits intodevelopfrom
feat/runs-on-rollout-1
Apr 3, 2025
Merged

feat: runs-on rollout part 1 (DX-365)#17026
erikburt merged 2 commits intodevelopfrom
feat/runs-on-rollout-1

Conversation

@erikburt
Copy link
Collaborator

@erikburt erikburt commented Mar 29, 2025

Initial rollout for runs-on self hosted runners.

After this PR is merged, only PRs with runs-on label will use self-hosted runners. Incremental rollout will start happening after this PR is merged.

This does change the build cache key but should have minimal to no effect for existing PRs once the proper caches on develop are populated.

Changes

  1. Update runner labels to use inline notation for runners (for now)
  2. Update the build cache to use a short sha in the key
    • The idea behind this is that our build caches can grow stale but we will never upsert/overwrite them in their current key format. With the ability to have unlimited cache with s3 self-hosted runners, we can effectively upsert / overwrite our caches everytime a new ref is pushed.
    • Currently, PRs which make multiple changes over time, the build cache is still synced to the first time the workflow was run.
  3. Add a 20 minute timeout to our unit tests. Anything longer than 20 minutes, means something is likely hanging and should fail.
    • This is increased for scheduled runs because race tests are extended during these.
  4. Add opt-in / opt-out labels for self-hosted runners
    • If a PR now adds runs-on it will use self-hosted runners
    • When we start rolling this out across PRs, if the PR has runs-on-opt-out label it will use Github runners
  5. Do not run sonarqube scan if the workflow was cancelled.
    • This job was configured with an always() which means it would still run even if the workflow was cancelled, and it was annoying.

Notes

There are some problems with the self-hosted runners. Although they are rare its possible:

  1. We do use spot instances when possible and spot instances can be interrupted, so sometimes a job will fail.
    • We do have automatic retries on, so if a job is interrupted by a spot instance termination, it will be automatically retried (this creates problem - see point 2)
    • I have seen it a few times in testing. If this is happening too often, then there are also ways to reduce the occurrence.
    • ie. Only use on-demand instances for merge queue so it won't interrupt crucial workflows.
  2. Automatic retries will interrupt new workflow runs:
    • If a job is cancelled due to a spot instance termination, and another commit is pushed, then runs-on will retry the already failed job, which will cause the newer workflow to fail.
    • This is a very rare edge case that shouldn't pop up too much, if it does there are ways to disable automatic retries.

Testing

I have run a ton of tests using this PR / branch:


DX-365

@erikburt erikburt self-assigned this Mar 29, 2025
@erikburt erikburt force-pushed the feat/runs-on-rollout-1 branch 4 times, most recently from 7673ae3 to 3e4b041 Compare April 1, 2025 18:30
@erikburt erikburt changed the title feat: runs-on rollout part 1 feat: runs-on rollout part 1 (DX-365) Apr 2, 2025
@erikburt erikburt force-pushed the feat/runs-on-rollout-1 branch from 6c4e984 to e3358e1 Compare April 2, 2025 18:51
@erikburt erikburt added the runs-on opt-in to self-hosted runners for certain jobs label Apr 2, 2025
@erikburt erikburt force-pushed the feat/runs-on-rollout-1 branch from aa8ea77 to 2d0f9e7 Compare April 2, 2025 22:04
@erikburt erikburt removed the runs-on opt-in to self-hosted runners for certain jobs label Apr 2, 2025
@erikburt erikburt force-pushed the feat/runs-on-rollout-1 branch from 2d0f9e7 to 781ea32 Compare April 2, 2025 22:19
@erikburt erikburt marked this pull request as ready for review April 2, 2025 22:30
@erikburt erikburt requested review from a team as code owners April 2, 2025 22:30
@erikburt erikburt added the runs-on opt-in to self-hosted runners for certain jobs label Apr 3, 2025
@erikburt erikburt force-pushed the feat/runs-on-rollout-1 branch from 8f26922 to dc7429e Compare April 3, 2025 00:19
@cl-sonarqube-production
Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@sebawo
Copy link
Contributor

sebawo commented Apr 3, 2025

@kalverra can you use your tool to compare execution times between hosted runners and self-hosted runners + unlimited cache?

@sebawo
Copy link
Contributor

sebawo commented Apr 3, 2025

@erikburt do i understand correctly that in this last run (https://github.com/smartcontractkit/chainlink/actions/runs/14230478702/job/39879953171?pr=17026) we leveraged the cache the most and this is why Core Tests (go_core_tests) run in 5min 17 sec and in the first run (https://github.com/smartcontractkit/chainlink/actions/runs/14227410891/job/39870270989?pr=17026) it was 16min 26 sec?

@erikburt
Copy link
Collaborator Author

erikburt commented Apr 3, 2025

@erikburt do i understand correctly that in this last run (smartcontractkit/chainlink/actions/runs/14230478702/job/39879953171?pr=17026) we leveraged the cache the most and this is why Core Tests (go_core_tests) run in 5min 17 sec and in the first run (smartcontractkit/chainlink/actions/runs/14227410891/job/39870270989?pr=17026) it was 16min 26 sec?

Yes - 1st run was no cache, but was also abnormally slow. There was probably a hanging test or something that had to be retried.

ccip deployment tests have seen almost no runtime improvement, and I don't understand why. I plan on digging into this more.

@erikburt erikburt added runs-on-opt-out opt-out to self-hosted runners for certain jobs and removed runs-on-opt-out opt-out to self-hosted runners for certain jobs labels Apr 3, 2025
@erikburt erikburt added this pull request to the merge queue Apr 3, 2025
Merged via the queue into develop with commit 90324d3 Apr 3, 2025
122 of 127 checks passed
@erikburt erikburt deleted the feat/runs-on-rollout-1 branch April 3, 2025 20:20
@erikburt erikburt mentioned this pull request Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

runs-on opt-in to self-hosted runners for certain jobs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments