Skip to content

Increase rescue wait timeout and warn at 80 percent#3

Merged
guettli merged 1 commit intomainfrom
tg/increase-timeout
Apr 16, 2026
Merged

Increase rescue wait timeout and warn at 80 percent#3
guettli merged 1 commit intomainfrom
tg/increase-timeout

Conversation

@guettli
Copy link
Copy Markdown
Collaborator

@guettli guettli commented Apr 10, 2026

Observed rescue waits completed around 5m30s, which pushed the existing 6m timeout above 90% usage and made the run look close to timing out even when it still succeeded.

done:

  • increased the default --timeout-wait-rescue timeout from 6m to 8m for both check-bm-server and create-host-yaml
  • added a ⚠️ marker to running step logs once timeout usage reaches 80%
❯ go run  github.com/syself/caphcli@latest check-bm-servers test/e2e/data/infrastructure-hetzner/v1beta1/bases/hetznerbaremetalhosts.yaml --name bm-e2e-1763731
WARNING: this will delete all data on disks with WWN(s): 0x500a07511756c36d
host "bm-e2e-1763731" (serverID=1763731)
Type "yes" to continue: yes
overall=0m00s destructive action confirmed for WWN(s): 0x500a07511756c36d
overall=0m00s step=load-input state=start timeout=0m30s
overall=0m00s step=load-input state=running elapsed=0m00s used=0.0% remaining=0m30s selected host "bm-e2e-1763731" (serverID=1763731)
overall=0m00s step=load-input state=running elapsed=0m00s used=0.0% remaining=0m30s loaded Robot + SSH credentials from environment
overall=0m00s step=load-input state=success duration=0m00s used=0.0% remaining=0m30s

overall=0m00s step=ensure-robot-ssh-key state=start timeout=1m00s
overall=0m00s step=ensure-robot-ssh-key state=running elapsed=0m00s used=0.5% remaining=1m00s using robot key="shared-2024-07-08" fingerprint="e5:18:e3:69:70:c9:fc:42:ba:1f:0f:eb:8a:16:7a:47"
overall=0m00s step=ensure-robot-ssh-key state=success duration=0m00s used=0.5% remaining=1m00s

overall=0m00s step=fetch-server-details state=start timeout=0m30s
overall=0m01s step=fetch-server-details state=running elapsed=0m00s used=1.4% remaining=0m30s server ip=144.76.101.50
overall=0m01s step=fetch-server-details state=success duration=0m00s used=1.4% remaining=0m30s

overall=0m01s step=pass-1-activate-rescue state=start timeout=0m45s
overall=0m02s step=pass-1-activate-rescue state=running elapsed=0m01s used=3.3% remaining=0m44s rescue mode activated
overall=0m02s step=pass-1-activate-rescue state=success duration=0m01s used=3.3% remaining=0m44s

overall=0m02s step=pass-1-reboot-to-rescue state=start timeout=0m45s
overall=0m03s step=pass-1-reboot-to-rescue state=running elapsed=0m01s used=2.2% remaining=0m44s hardware reboot requested
overall=0m03s step=pass-1-reboot-to-rescue state=success duration=0m01s used=2.2% remaining=0m44s

overall=0m03s step=pass-1-wait-rescue state=start timeout=6m00s
overall=0m08s step=pass-1-wait-rescue state=running elapsed=0m05s used=1.4% remaining=5m55s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=0m18s step=pass-1-wait-rescue state=running elapsed=0m15s used=4.2% remaining=5m45s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=0m28s step=pass-1-wait-rescue state=running elapsed=0m25s used=6.9% remaining=5m35s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=0m38s step=pass-1-wait-rescue state=running elapsed=0m35s used=9.7% remaining=5m25s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=0m48s step=pass-1-wait-rescue state=running elapsed=0m45s used=12.5% remaining=5m15s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=0m58s step=pass-1-wait-rescue state=running elapsed=0m55s used=15.3% remaining=5m05s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=1m08s step=pass-1-wait-rescue state=running elapsed=1m05s used=18.1% remaining=4m55s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=1m18s step=pass-1-wait-rescue state=running elapsed=1m15s used=20.8% remaining=4m45s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=1m28s step=pass-1-wait-rescue state=running elapsed=1m25s used=23.6% remaining=4m35s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
...
overall=5m18s step=pass-1-wait-rescue state=running elapsed=5m15s used=87.5% remaining=0m45s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=5m23s step=pass-1-wait-rescue state=running elapsed=5m20s used=88.9% remaining=0m40s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: connect: connection refused
overall=5m34s step=pass-1-wait-rescue state=running elapsed=5m30s used=91.8% remaining=0m30s rescue reachable (hostname="rescue")
overall=5m34s step=pass-1-wait-rescue state=success duration=5m30s used=91.8% remaining=0m30s

overall=5m34s step=pass-1-check-disk-in-rescue state=start timeout=1m00s
overall=5m34s step=pass-1-check-disk-in-rescue state=running elapsed=0m00s used=0.6% remaining=1m00s check-disk ok: Checking WWN=0x500a07511756c36d device=sdb
check-disk passed. Provided WWNs look healthy.

0x500a07511756c36d (/dev/sdb): SMART overall-health self-assessment test result: PASSED
overall=5m34s step=pass-1-check-disk-in-rescue state=success duration=0m00s used=0.6% remaining=1m00s

overall=5m34s step=pass-1-install-ubuntu-24.04 state=start timeout=9m00s
overall=5m35s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m01s used=0.1% remaining=8m59s install target devices: sdb
overall=5m35s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m01s used=0.2% remaining=8m59s autosetup uploaded
overall=5m35s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m01s used=0.3% remaining=8m59s post-install script uploaded
overall=5m36s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m02s used=0.3% remaining=8m58s installimage files uploaded
overall=5m37s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m03s used=0.6% remaining=8m57s installimage started
overall=5m46s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m12s used=2.2% remaining=8m48s installimage is still running
overall=5m56s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m22s used=4.1% remaining=8m38s installimage is still running
overall=6m06s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m32s used=5.9% remaining=8m28s installimage is still running
overall=6m16s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m42s used=7.8% remaining=8m18s installimage is still running
overall=6m26s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m52s used=9.6% remaining=8m08s installimage is still running
overall=6m36s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m02s used=11.5% remaining=7m58s installimage is still running
overall=6m46s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m12s used=13.3% remaining=7m48s installimage is still running
overall=6m56s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m22s used=15.2% remaining=7m38s installimage is still running
overall=7m06s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m32s used=17.0% remaining=7m28s installimage is still running
overall=7m17s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m43s used=19.1% remaining=7m17s installimage finished and marker found
overall=7m17s step=pass-1-install-ubuntu-24.04 state=success duration=1m43s used=19.1% remaining=7m17s

overall=7m17s step=pass-1-reboot-to-os state=start timeout=0m45s
overall=7m17s step=pass-1-reboot-to-os state=running elapsed=0m00s used=0.7% remaining=0m45s reboot command sent from rescue
overall=7m17s step=pass-1-reboot-to-os state=success duration=0m00s used=0.7% remaining=0m45s

overall=7m17s step=pass-1-wait-os state=start timeout=6m00s
overall=7m17s step=pass-1-wait-os state=running elapsed=0m00s used=0.1% remaining=6m00s os reachable (hostname="bm-e2e-1763731")
overall=7m17s step=pass-1-wait-os state=success duration=0m00s used=0.1% remaining=6m00s

overall=7m17s step=pass-2-activate-rescue state=start timeout=0m45s
overall=7m19s step=pass-2-activate-rescue state=running elapsed=0m02s used=3.9% remaining=0m43s rescue mode activated
overall=7m19s step=pass-2-activate-rescue state=success duration=0m02s used=3.9% remaining=0m43s

overall=7m19s step=pass-2-reboot-to-rescue state=start timeout=0m45s
overall=7m20s step=pass-2-reboot-to-rescue state=running elapsed=0m01s used=2.4% remaining=0m44s hardware reboot requested
overall=7m20s step=pass-2-reboot-to-rescue state=success duration=0m01s used=2.4% remaining=0m44s

overall=7m20s step=pass-2-wait-rescue state=start timeout=6m00s
overall=7m25s step=pass-2-wait-rescue state=running elapsed=0m05s used=1.4% remaining=5m55s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=7m35s step=pass-2-wait-rescue state=running elapsed=0m15s used=4.2% remaining=5m45s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=7m45s step=pass-2-wait-rescue state=running elapsed=0m25s used=6.9% remaining=5m35s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=7m55s step=pass-2-wait-rescue state=running elapsed=0m35s used=9.7% remaining=5m25s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=8m05s step=pass-2-wait-rescue state=running elapsed=0m45s used=12.5% remaining=5m15s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=8m15s step=pass-2-wait-rescue state=running elapsed=0m55s used=15.3% remaining=5m05s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
overall=8m25s step=pass-2-wait-rescue state=running elapsed=1m05s used=18.1% remaining=4m55s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: i/o timeout
...
overall=12m30s step=pass-2-wait-rescue state=running elapsed=5m10s used=86.1% remaining=0m50s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: connect: connection refused
overall=12m40s step=pass-2-wait-rescue state=running elapsed=5m20s used=88.9% remaining=0m40s waiting for rescue ssh: failed to dial ssh (user=root host=144.76.101.50 port=22 timeout=5s): dial tcp 144.76.101.50:22: connect: connection refused
overall=12m51s step=pass-2-wait-rescue state=running elapsed=5m30s used=91.8% remaining=0m30s rescue reachable (hostname="rescue")
overall=12m51s step=pass-2-wait-rescue state=success duration=5m30s used=91.8% remaining=0m30s

overall=12m51s step=pass-2-check-disk-in-rescue state=start timeout=1m00s
overall=12m51s step=pass-2-check-disk-in-rescue state=running elapsed=0m00s used=0.6% remaining=1m00s check-disk ok: Checking WWN=0x500a07511756c36d device=sdc
check-disk passed. Provided WWNs look healthy.

0x500a07511756c36d (/dev/sdc): SMART overall-health self-assessment test result: PASSED
overall=12m51s step=pass-2-check-disk-in-rescue state=success duration=0m00s used=0.6% remaining=1m00s

overall=12m51s step=pass-2-install-ubuntu-24.04 state=start timeout=9m00s
overall=12m52s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m01s used=0.1% remaining=8m59s install target devices: sdc
overall=12m52s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m01s used=0.1% remaining=8m59s autosetup uploaded
overall=12m52s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m01s used=0.2% remaining=8m59s post-install script uploaded
overall=12m53s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m02s used=0.3% remaining=8m58s installimage files uploaded
overall=12m54s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m03s used=0.6% remaining=8m57s installimage started
overall=13m03s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m12s used=2.2% remaining=8m48s installimage is still running
overall=13m13s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m22s used=4.1% remaining=8m38s installimage is still running
overall=13m23s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m32s used=5.9% remaining=8m28s installimage is still running
overall=13m33s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m42s used=7.8% remaining=8m18s installimage is still running
overall=13m43s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m52s used=9.6% remaining=8m08s installimage is still running
overall=13m53s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m02s used=11.5% remaining=7m58s installimage is still running
overall=14m03s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m12s used=13.3% remaining=7m48s installimage is still running
overall=14m13s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m22s used=15.2% remaining=7m38s installimage is still running
overall=14m23s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m32s used=17.0% remaining=7m28s installimage is still running
overall=14m34s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m43s used=19.0% remaining=7m17s installimage finished and marker found
overall=14m34s step=pass-2-install-ubuntu-24.04 state=success duration=1m43s used=19.0% remaining=7m17s

overall=14m34s step=pass-2-reboot-to-os state=start timeout=0m45s
overall=14m34s step=pass-2-reboot-to-os state=running elapsed=0m00s used=0.7% remaining=0m45s reboot command sent from rescue
overall=14m34s step=pass-2-reboot-to-os state=success duration=0m00s used=0.7% remaining=0m45s

overall=14m34s step=pass-2-wait-os state=start timeout=6m00s
overall=14m34s step=pass-2-wait-os state=running elapsed=0m00s used=0.1% remaining=6m00s os reachable (hostname="bm-e2e-1763731")
overall=14m34s step=pass-2-wait-os state=success duration=0m00s used=0.1% remaining=6m00s

overall=14m34s all checks passed: machine "bm-e2e-1763731" (serverID=1763731) completed two rescue+install+boot cycles

@guettli guettli marked this pull request as ready for review April 10, 2026 08:17
@guettli guettli merged commit eac8380 into main Apr 16, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant