Fix CI #389

timothystewart6 · 2023-10-29T18:45:26Z

Proposed Changes

I don't even know where to begin 😅

GitHub no longer seems to allow us to use nested virtualization on the macos-12 image
We need nested virtualization in order to create VMs during CI to run our playbook
Nested virtualization isn't enabled on macos-13
They claim it's enabled for large runner instances
- These are paid
- They are ubuntu not macOS
- I signed up, tested, and it didn't seem to work
Rewrote the playbook to work with Linux (ubuntu)
I ended up self-hosting a VM myself
- Nested virtualization
- 16 cores
- 64 GB RAM
- 100 GB SSD
- Due to nested virtualization it's still slow, but stable
- Another downside is that you can only run on job at a time on a self-hosted runner. GHA really needs to start using containers like every other CI out there instead of stateful VMs.
  - If this becomes a problem I will break the one I have into 3. It doesn't effectively or efficiently use the resources I gave it now, probably due to nested virtualization.
I ended up upgrading all dependencies while trying to make this work with GitHub's macOS images
Our days were numbered with macOS anyway, Virtual Box isn't supporting Apple Silicon
I should switch to Vagrant + VirtIO at some point.

I also used this PR to reworking CI steps and caching. While troubleshooting this CI job I found that it was better to cache things up front in a pre step, vs letting each step decide. Steps downstream only restore cache. Also, it seem like a race condition that all 3 molecule tests could download all vms and then save all vms. Caching them up front and forcing them to download only ensures that there's no unnecessary save later.

All of this should work. I am going to merge it shortly after the tests pass (again) so that we can get caught up on PRs

Checklist

sleiner · 2023-11-01T14:40:57Z

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.
Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

timothystewart6 · 2023-11-01T16:11:21Z

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.

Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

Thanks @sleiner I am coming to the same conclusion too. This was more of debugging experiment to track down build errors. Our error seems to be a Virtual Box error.

First was to run jobs serially to see if concurrency was an issue since we are using hosted runners. My suspicion was that we might be scheduled on the same host and somehow there was a conflict with 2 virtual box instances running.
Next was to remove build caches (I manually cleared them from the runners)
Then removed the cache job from CI to see if we might have cached a bad state for the Vagrant VMs.

This PR will not be merged, I am just working some of my suspicions, I have a few more to chase down.

Virtual Box Error

 fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/vagrant", "up", "--no-provision"], "msg": "Failed to start the VM(s): See log file '/Users/runner/.cache/molecule/k3s-ansible/default/vagrant.err'", "rc": 1, "stderr": "### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\nThe guest machine entered an invalid state while waiting for it\nto boot. Valid states are 'starting, running'. The machine is in the\n'gurumeditation' state. Please verify everything is configured\nproperly and try again.\n\nIf the provider you're using has a GUI that comes with it,\nit is often helpful to open that and watch the machine, since the\nGUI often has more helpful error messages than Vagrant can retrieve.\nFor example, if you're using VirtualBox, run `vagrant up` while the\nVirtualBox GUI is open.\n\nThe primary issue for this error is that the provider you're using\nis not properly configured. This is very rarely a Vagrant issue.\n", "stderr_lines": ["### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "The guest machine entered an invalid state while waiting for it", "to boot. Valid states are 'starting, running'. The machine is in the", "'gurumeditation' state. Please verify everything is configured", "properly and try again.", "", "If the provider you're using has a GUI that comes with it,", "it is often helpful to open that and watch the machine, since the", "GUI often has more helpful error messages than Vagrant can retrieve.", "For example, if you're using VirtualBox, run `vagrant up` while the", "VirtualBox GUI is open.", "", "The primary issue for this error is that the provider you're using", "is not properly configured. This is very rarely a Vagrant issue."]}

sleiner · 2023-11-01T16:32:08Z

From the error message that you posted:

The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'starting, running'. The machine is in the
'gurumeditation' state. Please verify everything is configured
properly and try again.

That's a trippy error message if I've ever seen one 😅

If the provider you're using has a GUI that comes with it,
it is often helpful to open that and watch the machine, since the
GUI often has more helpful error messages than Vagrant can retrieve.
For example, if you're using VirtualBox, run vagrant up while the
VirtualBox GUI is open.

Hmm, accessing the GUI will not be possible, but maybe we can get the actual VirtualBox logs out?

sleiner · 2023-11-01T16:35:27Z

@timothystewart6 btw if you are testing runner types, it might be worthwhile to check out macos-13. You even get 33 % more cores (4 instead of 3) 😄

timothystewart6 · 2023-11-01T17:06:56Z

@sleiner good eye! I only saw latest/12 and tested that. I will try to log out more from VB later too!

timothystewart6 · 2023-11-01T17:13:18Z

Ah, vagrant doesn't seem to be installed on macos-13 , might need to brew install it. I feel like at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

sleiner · 2023-11-01T17:40:07Z

at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

In principle, porting this to Ubuntu is trivial (since Molecule, Vagrant and VirtualBox all work on Ubuntu too). In GitHub Actions, we are using macOS however, because at the time these were the only GitHub-hosted runners with support for nested virtualization. In the mean time, GitHub seems to have added KVM support to large runners (i.e., the ones that are not free to use). So as far as I see, with GitHub-hosted runners, macOS is the only viable option.

timothystewart6 · 2023-11-01T18:38:40Z

Yeah, I just read through all of those threads. Seems like enabling it on macOS was an oversight. I wonder if that's why vagrant isn't installed on 13, maybe they "fixed the bug" 😆. I'll keep plowing through this and figure it out!

joaopedrocg27 · 2023-12-07T17:27:23Z

Hey

First of all let me say that I am not familiar with molecule but I tried to run the test workflow on my fork and it looks like this is facing some resource starvation.

One of the steps ran without an issue: Run Link

Are we provisioning several vms at the same time? If so, could we split those workloads? (at least just to test)
I am getting a Disconnect error which mean that the runner service died. I've had this happen when the host ran out of memory or with high CPU times (we fixed those with a higher nice but I doubt that GH managed runners allow that easily)

It could also be the cache since it only worked the first time

Hope that helps

joaopedrocg27 · 2023-12-07T17:36:31Z

according to this

"It's getting a triple fault error, which possibly means disk corruption"

So another vote on the cache

timothystewart6 · 2024-01-21T17:47:50Z

I think I found the answer is that we are running out of resources during CI. Reducing allocation seemed to fix it on the first attempt. I have a feeling that sometimes we would get lucky and land on a macOS machine that had enough resources to process our CI, most other times it would not.

Similar Issue (it's not just us)
actions/runner-images#8730

After that change, it worked on the first try.

timothystewart6 · 2024-01-21T17:54:45Z

I am most likely going to add some additional changes to this job.

Create a cache dependencies pre-step. Caching them when we run the jobs in parallel is a race condition.
Update to the latest macOS runner version. This might lead to installing vagrant and virtualbox ourselves. Details of image here
I might consider running these in serial to limit runner usage. This does not change anything functionally but is really about being a good neighbor. Also, if one fails, there's no need to run the others.

timothystewart6 · 2024-01-23T03:32:27Z

After working on this for about 2 days straight, I think I finally understand what's going on

actions/runner-images#8642

Turns out nested virtualization was working on some macos-12 images, but they've slowly been replacing them with new images. This explains why sometimes our builds would pass, we got lucky and landed on one of the old images that support it.

Turns out nested virtualization is only supported on their large runners, which is a paid teams feature. I will look into cost and I will also consider running self-hosted runners. CI for this is pretty important but also not sure I can afford GitHub Team when I am a solo developer.

timothystewart6 · 2024-01-23T04:26:33Z

I signed up for teams, converted to ubuntu, and provisioned a large runner and I am still seeing the same thing. That thread has so many conflicting comments it's hard to know what's right. Someone also mentioned after the update to ubuntu runners the other day that it now works on ubuntu-latest too. Testing that too. If this all fails I will self host a runner with nested virtualization turned on.

timothystewart6 · 2024-01-25T02:43:06Z

Good news, my self-hosted runners are working! I am going to clean up this PR before merge, I will also update the summary too.

timothystewart6 · 2024-01-25T03:10:50Z

Updated PR description to reflect everything that I did. Going to merge after tests pass.

Here it is in action!

timothystewart6 · 2024-01-25T15:31:11Z

Just reporting back, we now have 3 self-hosted runners so jobs can run concurrently

timothystewart6 · 2024-01-25T15:37:26Z

timothystewart6 changed the title ~~Fix ci~~ Fix CI Oct 29, 2023

timothystewart6 enabled auto-merge (squash) October 29, 2023 20:47

timothystewart6 disabled auto-merge October 30, 2023 20:26

timothystewart6 added the do not merge Do not merge this PR label Nov 1, 2023

timothystewart6 closed this Jan 21, 2024

timothystewart6 force-pushed the fix-ci branch from 6643d0b to edf0c9e Compare January 21, 2024 16:44

fix(molecule): Reducing cores and memory

ac4a6e7

timothystewart6 reopened this Jan 21, 2024

fix(molecule): Reducing cores and memory

fecf7c7

timothystewart6 removed the do not merge Do not merge this PR label Jan 21, 2024

timothystewart6 added 8 commits January 21, 2024 12:11

fix(ci): Add a cache prestep

9084c90

fix(ci): Add a cache prestep

867eabc

fix(ci): Add a cache prestep

20ea0bc

fix(ci): Add a cache prestep

f54eb1b

fix(ci): Add a cache prestep

7a8c7ec

fix(ci): Add a cache prestep

7cfcd97

fix(ci): Add a cache prestep

a2d4e91

fix(ci): Add a cache prestep

3bb8984

timothystewart6 added 9 commits January 22, 2024 19:04

fix(ci): fix molecule

c8efd7d

fix(ci): fix molecule

b00f020

fix(ci): fix molecule

83c5866

fix(ci): fix molecule

b928c93

fix(ci): fix molecule

e0ecd44

fix(ci): fix molecule

d7a9a8d

fix(ci): fix molecule

ddcbead

fix(ci): fix molecule

f239f8e

fix(ci): testing with macos-12

e5a1da1

timothystewart6 added 2 commits January 22, 2024 21:53

fix(ci): move to ubuntu

c54a1c6

fix(ci): move to ubuntu

903f302

timothystewart6 added 5 commits January 23, 2024 18:18

fix(ci): self-hosted

fd026d6

fix(ci): self-hosted

2d0716b

fix(ci): self-hosted

069b9b1

fix(ci): self-hosted

675b6f6

fix(ci): self-hosted

2bc85be

timothystewart6 added 3 commits January 24, 2024 20:51

fix(ci): self-hosted

e63e207

fix(ci): only run jobs that require self-hosted runners

634fbb6

fix(ci): mnevermind

0589160

timothystewart6 merged commit e2e9881 into master Jan 25, 2024
6 checks passed

timothystewart6 deleted the fix-ci branch January 25, 2024 04:26

This was referenced Jan 26, 2024

CI - Address Failures #132

Closed

Consider Self-Hosted Runners for Performance #149

Closed

Add Concurrency GHA Feature to CI #386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI #389

Fix CI #389

timothystewart6 commented Oct 29, 2023 •

edited

Loading

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023

sleiner commented Nov 1, 2023

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023 •

edited

Loading

timothystewart6 commented Nov 1, 2023 •

edited

Loading

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023

joaopedrocg27 commented Dec 7, 2023 •

edited

Loading

joaopedrocg27 commented Dec 7, 2023 •

edited

Loading

timothystewart6 commented Jan 21, 2024

timothystewart6 commented Jan 21, 2024

timothystewart6 commented Jan 23, 2024 •

edited

Loading

timothystewart6 commented Jan 23, 2024

timothystewart6 commented Jan 25, 2024

timothystewart6 commented Jan 25, 2024 •

edited

Loading

timothystewart6 commented Jan 25, 2024

timothystewart6 commented Jan 25, 2024

Fix CI #389

Fix CI #389

Conversation

timothystewart6 commented Oct 29, 2023 • edited Loading

Proposed Changes

Checklist

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023

sleiner commented Nov 1, 2023

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023 • edited Loading

timothystewart6 commented Nov 1, 2023 • edited Loading

sleiner commented Nov 1, 2023

timothystewart6 commented Nov 1, 2023

joaopedrocg27 commented Dec 7, 2023 • edited Loading

joaopedrocg27 commented Dec 7, 2023 • edited Loading

timothystewart6 commented Jan 21, 2024

timothystewart6 commented Jan 21, 2024

timothystewart6 commented Jan 23, 2024 • edited Loading

timothystewart6 commented Jan 23, 2024

timothystewart6 commented Jan 25, 2024

timothystewart6 commented Jan 25, 2024 • edited Loading

timothystewart6 commented Jan 25, 2024

timothystewart6 commented Jan 25, 2024

timothystewart6 commented Oct 29, 2023 •

edited

Loading

timothystewart6 commented Nov 1, 2023 •

edited

Loading

timothystewart6 commented Nov 1, 2023 •

edited

Loading

joaopedrocg27 commented Dec 7, 2023 •

edited

Loading

joaopedrocg27 commented Dec 7, 2023 •

edited

Loading

timothystewart6 commented Jan 23, 2024 •

edited

Loading

timothystewart6 commented Jan 25, 2024 •

edited

Loading