Skip to content

fix: recover VM availability after reconnect#1947

Merged
elibosley merged 3 commits intomainfrom
codex/fix-vm-reconnect-availability
Mar 22, 2026
Merged

fix: recover VM availability after reconnect#1947
elibosley merged 3 commits intomainfrom
codex/fix-vm-reconnect-availability

Conversation

@elibosley
Copy link
Member

@elibosley elibosley commented Mar 22, 2026

Summary

  • make VM operations re-establish libvirt on demand instead of trusting a stale availability flag
  • retry getDomains() once when the libvirt connection drops mid-request
  • add unit coverage for stale availability state and dropped libvirt connections

Root cause

The VM service cached availability in isVmsAvailable and only updated that state from bootstrap and PID-file watcher events. During reconnect/bootstrap races, that flag could stay false even after libvirt was healthy again, so the dashboard surfaced Failed to retrieve VM domains: VMs are not available instead of self-healing.

Testing

  • pnpm exec vitest run src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts src/unraid-api/graph/resolvers/vms/vms.resolver.spec.ts
  • pnpm exec tsc --noEmit -p tsconfig.json

Summary by CodeRabbit

  • Bug Fixes

    • VM operations now recover more reliably from hypervisor connection errors (automatic single retry and consistent cleanup), improving stability during unexpected disconnects and shutdowns.
  • Refactor

    • Serialized hypervisor initialization and centralized access to prevent duplicate startups and ensure consistent cleanup across VM workflows.
  • Tests

    • Added comprehensive unit tests covering connection recovery, concurrent initialization, shutdown retrying, and cleanup scenarios.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 22, 2026

Walkthrough

Refactors VmsService to centralize hypervisor initialization and cleanup (serialize init via initializeHypervisorOnce()), add ensureHypervisorAvailable() and resetHypervisorConnection(), make VM operations use the ensured hypervisor instance, add domain-list retry on connection errors, and add a comprehensive unit test suite validating these behaviors.

Changes

Cohort / File(s) Summary
VMs Service Refactor
api/src/unraid-api/graph/resolvers/vms/vms.service.ts
Serialize hypervisor init (via hypervisorInitialization/initializeHypervisorOnce()), add ensureHypervisorAvailable() and resetHypervisorConnection(), centralize domain listing with listDomains(hypervisor), update VM operations to use returned hypervisor instance, and retry once on connection-related getDomains() failures. Cleanup paths now consistently null this.hypervisor and set this.isVmsAvailable = false.
Unit Tests (new)
api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts
New Vitest/Nest unit test suite with hoisted mocks for fs/promises, @unraid/libvirt (Hypervisor), chokidar.watch, and app store path. Tests cover initialization serialization, retry on socket-close, proper cleanup while close is in-progress, and shutdown polling resilience.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as VmsService
    participant Hypervisor
    participant FS as FS/Watcher

    Client->>Service: getDomains()
    Service->>Service: ensureHypervisorAvailable()
    Service->>Hypervisor: connectOpen()
    Hypervisor-->>Service: connected
    Service->>Service: listDomains(hypervisor)
    Service->>Hypervisor: listAllDomains()
    Hypervisor-->>Service: domains
    Service-->>Client: return domains

    Note over Hypervisor,Service: connection error on list
    Client->>Service: getDomains()
    Service->>Service: listDomains(hypervisor)
    Service->>Hypervisor: listAllDomains()
    Hypervisor--XService: connection error
    Service->>Service: resetHypervisorConnection()
    Service->>Service: ensureHypervisorAvailable() (retry)
    Service->>Hypervisor: connectOpen()
    Hypervisor-->>Service: connected
    Service->>Service: listDomains(hypervisor)
    Service->>Hypervisor: listAllDomains()
    Hypervisor-->>Service: domains
    Service-->>Client: return domains
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through code to stitch the threads,
Ensured the hypervisor wakes from beds,
Reset the ties when sockets fray,
Retry once more, then clear the way,
Tests keep watch as rabbits play.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: recovering VM availability after a reconnect by re-establishing libvirt on demand instead of relying on a cached flag.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-vm-reconnect-availability

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d9fd8c2d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +380 to +381
await this.resetHypervisorConnection();
hypervisor = await this.ensureHypervisorAvailable();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dropping the shared hypervisor during a list retry

If getDomains() hits this reconnect branch while a stopVm() or rebootVm() call is already inside waitForDomainShutdown(), resetHypervisorConnection() clears this.hypervisor before the replacement connection is opened. waitForDomainShutdown() reads this.hypervisor on every poll, so the in-flight mutation can start failing with Hypervisor is not initialized halfway through a shutdown/reboot. This only requires a transient connectListAllDomains failure during normal VM polling, so a harmless read-side reconnect can now break write-side VM actions.

Useful? React with 👍 / 👎.

@codecov
Copy link

codecov bot commented Mar 22, 2026

Codecov Report

❌ Patch coverage is 84.21053% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.71%. Comparing base (018a8d5) to head (ba93769).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
.../src/unraid-api/graph/resolvers/vms/vms.service.ts 84.21% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1947      +/-   ##
==========================================
+ Coverage   51.60%   51.71%   +0.11%     
==========================================
  Files        1025     1025              
  Lines       70597    70667      +70     
  Branches     7831     7883      +52     
==========================================
+ Hits        36432    36547     +115     
+ Misses      34042    33997      -45     
  Partials      123      123              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/src/unraid-api/graph/resolvers/vms/vms.service.ts (2)

371-396: ⚠️ Potential issue | 🟡 Minor

Keep retry failures inside the existing error wrapper.

Lines 381-382 return the retry directly from inside the catch, so a failed second attempt bypasses the logging and GraphQLError('Failed to get domains: ...') path below. That changes what api/src/unraid-api/graph/resolvers/vms/vms.resolver.ts around Lines 24-32 ultimately surfaces.

🔧 Suggested fix
         } catch (error: unknown) {
             if (this.isConnectionError(error)) {
                 this.logger.warn(
                     `VM domain lookup lost its libvirt connection: ${error.message}. Resetting and retrying once.`
                 );
                 await this.resetHypervisorConnection();
-                hypervisor = await this.ensureHypervisorAvailable();
-                return await this.listDomains(hypervisor);
+                try {
+                    hypervisor = await this.ensureHypervisorAvailable();
+                    return await this.listDomains(hypervisor);
+                } catch (retryError) {
+                    this.logger.error(
+                        `Failed to get domains after reconnect: ${
+                            retryError instanceof Error ? retryError.message : 'Unknown error'
+                        }`
+                    );
+                    throw new GraphQLError(
+                        `Failed to get domains: ${
+                            retryError instanceof Error ? retryError.message : 'Unknown error'
+                        }`
+                    );
+                }
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 371 -
396, In getDomains, the catch currently returns the retry result directly which
bypasses the later logging/GraphQLError path; instead, when
isConnectionError(error) is true call resetHypervisorConnection(), re-acquire
hypervisor via ensureHypervisorAvailable(), then call listDomains(hypervisor)
but do NOT return from inside the catch—capture the result or let any error
bubble so the existing downstream logging and the final GraphQLError throw (the
existing error-handling block) still runs; adjust the flow in getDomains
(referencing ensureHypervisorAvailable, isConnectionError,
resetHypervisorConnection, and listDomains) so second-attempt failures are
handled by the same logging/throw logic rather than bypassing it.

69-80: ⚠️ Potential issue | 🔴 Critical

Serialize all hypervisor initialization paths.

Lines 71, 103, and 179 each enter initializeHypervisor() independently. If bootstrap/watch recovery overlaps a request while availability is false, both callers can race into connectOpen(), and one failure path can then tear down the connection another caller just established. That leaves the reconnect/bootstrap race in place under concurrent traffic.

🔧 Suggested direction
+    private hypervisorInitialization: Promise<void> | null = null;
+
+    private async initializeHypervisorOnce(): Promise<void> {
+        if (!this.hypervisorInitialization) {
+            this.hypervisorInitialization = (async () => {
+                try {
+                    await this.initializeHypervisor();
+                    this.isVmsAvailable = true;
+                } finally {
+                    this.hypervisorInitialization = null;
+                }
+            })();
+        }
+
+        await this.hypervisorInitialization;
+    }
+
     private async attemptHypervisorInitializationAndWatch(): Promise<void> {
         try {
-            await this.initializeHypervisor();
-            this.isVmsAvailable = true;
+            await this.initializeHypervisorOnce();
             this.logger.debug(`VMs service initialized successfully with URI: ${this.uri}`);
             await this.setupWatcher();
         } catch (error) {
             await this.resetHypervisorConnection();
@@
             .on('add', async () => {
@@
                 try {
-                    await this.initializeHypervisor();
-                    this.isVmsAvailable = true;
+                    await this.initializeHypervisorOnce();
                     this.logger.log(
                         'Hypervisor connection established successfully after PID file detection.'
                     );
                 } catch (error) {
@@
     private async ensureHypervisorAvailable(): Promise<InstanceType<typeof HypervisorClass>> {
         if (this.isVmsAvailable && this.hypervisor) {
             return this.hypervisor;
         }
 
         try {
-            await this.initializeHypervisor();
-            this.isVmsAvailable = true;
+            await this.initializeHypervisorOnce();
         } catch (error) {
             await this.resetHypervisorConnection();
             throw new GraphQLError('VMs are not available');
         }

Also applies to: 97-114, 173-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 69 - 80,
Multiple callers (attemptHypervisorInitializationAndWatch, request handlers at
lines ~97 and ~173) race into initializeHypervisor/connectOpen causing teardown
of a connection another caller created; serialize these paths by adding a single
shared async lock or in-flight promise guard around
initializeHypervisor/connectOpen (e.g., a private initLock or
inFlightInitPromise on the VmsService class) so concurrent callers await the
same initialization instead of running it in parallel, and inside the guarded
section verify connection state (isVmsAvailable/uri or a connection object)
after acquiring the lock before performing resetHypervisorConnection or teardown
so a failing attempt does not destroy a connection established by another
concurrent initializer; apply the same guard to calls from
attemptHypervisorInitializationAndWatch, bootstrap watcher paths, and any
request entrypoints that call initializeHypervisor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 162-170: The resetHypervisorConnection method should clear shared
state before awaiting teardown to avoid races: capture the current hypervisor
instance into a local variable (e.g., const old = this.hypervisor), immediately
set this.hypervisor = null and this.isVmsAvailable = false, then call await
old?.connectClose() inside the try/catch so overlapping calls to
ensureHypervisorAvailable() won't receive a handle that's being torn down;
retain the existing error logging for connectClose failures.

---

Outside diff comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 371-396: In getDomains, the catch currently returns the retry
result directly which bypasses the later logging/GraphQLError path; instead,
when isConnectionError(error) is true call resetHypervisorConnection(),
re-acquire hypervisor via ensureHypervisorAvailable(), then call
listDomains(hypervisor) but do NOT return from inside the catch—capture the
result or let any error bubble so the existing downstream logging and the final
GraphQLError throw (the existing error-handling block) still runs; adjust the
flow in getDomains (referencing ensureHypervisorAvailable, isConnectionError,
resetHypervisorConnection, and listDomains) so second-attempt failures are
handled by the same logging/throw logic rather than bypassing it.
- Around line 69-80: Multiple callers (attemptHypervisorInitializationAndWatch,
request handlers at lines ~97 and ~173) race into
initializeHypervisor/connectOpen causing teardown of a connection another caller
created; serialize these paths by adding a single shared async lock or in-flight
promise guard around initializeHypervisor/connectOpen (e.g., a private initLock
or inFlightInitPromise on the VmsService class) so concurrent callers await the
same initialization instead of running it in parallel, and inside the guarded
section verify connection state (isVmsAvailable/uri or a connection object)
after acquiring the lock before performing resetHypervisorConnection or teardown
so a failing attempt does not destroy a connection established by another
concurrent initializer; apply the same guard to calls from
attemptHypervisorInitializationAndWatch, bootstrap watcher paths, and any
request entrypoints that call initializeHypervisor.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c26f0dd1-bee3-4123-a1e6-f368ae899457

📥 Commits

Reviewing files that changed from the base of the PR and between d3e0b95 and 9d9fd8c.

📒 Files selected for processing (2)
  • api/src/unraid-api/graph/resolvers/vms/vms.service.ts
  • api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts

@github-actions
Copy link
Contributor

This plugin has been deployed to Cloudflare R2 and is available for testing.
Download it at this URL:

https://preview.dl.unraid.net/unraid-api/tag/PR1947/dynamix.unraid.net.plg

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
api/src/unraid-api/graph/resolvers/vms/vms.service.ts (2)

426-450: Potential for partial failures during domain mapping.

Promise.all() will reject immediately if any single domain's getInfo(), getName(), or getUUIDString() fails, losing results from domains that succeeded. If one domain is in a transient state, this could cause the entire listing to fail.

♻️ Consider using Promise.allSettled for resilience
-    return Promise.all(
-        domains.map(async (domain) => {
-            const info = await domain.getInfo();
-            const name = await domain.getName();
-            const uuid = await domain.getUUIDString();
-            const state = this.mapDomainStateToVmState(info.state);
-
-            return {
-                id: uuid,
-                uuid,
-                name,
-                state,
-            };
-        })
-    );
+    const results = await Promise.allSettled(
+        domains.map(async (domain) => {
+            const info = await domain.getInfo();
+            const name = await domain.getName();
+            const uuid = await domain.getUUIDString();
+            const state = this.mapDomainStateToVmState(info.state);
+
+            return {
+                id: uuid,
+                uuid,
+                name,
+                state,
+            };
+        })
+    );
+
+    const fulfilled = results.filter(
+        (r): r is PromiseFulfilledResult<VmDomain> => r.status === 'fulfilled'
+    );
+    const rejected = results.filter((r) => r.status === 'rejected');
+
+    if (rejected.length > 0) {
+        this.logger.warn(`Failed to get info for ${rejected.length} domain(s)`);
+    }
+
+    return fulfilled.map((r) => r.value);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 426 -
450, The listDomains method currently uses Promise.all which fails fast if any
domain.getInfo(), getName(), or getUUIDString() rejects; change the mapping to
use Promise.allSettled on domains.map(...) so you can collect fulfilled results
and discard or log rejected ones (include domain identifier where possible),
then return only the successfully resolved VmDomain objects; keep
mapDomainStateToVmState unchanged and ensure you log failures from
getInfo/getName/getUUIDString for debugging while still returning the other
domains.

200-210: Consider logging the original error before wrapping.

When initializeHypervisorOnce() fails, the original error is swallowed and replaced with a generic "VMs are not available" message. This could make debugging harder when the root cause isn't obvious.

🔧 Suggested improvement
     private async ensureHypervisorAvailable(): Promise<InstanceType<typeof HypervisorClass>> {
         if (this.isVmsAvailable && this.hypervisor) {
             return this.hypervisor;
         }
 
         try {
             return await this.initializeHypervisorOnce();
         } catch (error) {
+            this.logger.debug(`Hypervisor initialization failed: ${(error as Error).message}`);
             throw new GraphQLError('VMs are not available');
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 200 -
210, ensureHypervisorAvailable currently swallows the original error from
initializeHypervisorOnce; catch the error, log the original error (use
this.logger.error if a logger exists on the service instance, otherwise
console.error) with context like "initializeHypervisorOnce failed", and then
rethrow a GraphQLError while preserving some original details (e.g., include
error.message in the GraphQLError message or add it to GraphQLError.extensions)
so callers get the generic "VMs are not available" plus the original error info;
update the catch block in ensureHypervisorAvailable to perform the logging and
include the original error details when throwing.
api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts (1)

220-246: Good test: Validates shutdown polling uses captured hypervisor.

This test ensures that waitForDomainShutdown continues polling with the original hypervisor instance even when shared state is reset mid-operation. Proper use of fake timers.

One minor note: ensure vi.useRealTimers() is called even if the test fails to avoid affecting other tests.

♻️ Consider using afterEach for timer cleanup

You could add an afterEach hook to ensure timers are always restored:

afterEach(() => {
    vi.useRealTimers();
});

This would make the cleanup more robust if other tests start using fake timers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts` around lines
220 - 246, The test "finishes shutdown polling with the original hypervisor even
if shared state is reset" uses vi.useFakeTimers() but only restores timers at
the end of the test, which can leak if the test fails; add a global cleanup to
always restore real timers (e.g., an afterEach that calls vi.useRealTimers()) so
any test that uses vi.useFakeTimers() (including this spec and helpers around
service.onApplicationBootstrap, service.stopVm, waitForDomainShutdown) will not
leave fake timers active across tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 426-450: The listDomains method currently uses Promise.all which
fails fast if any domain.getInfo(), getName(), or getUUIDString() rejects;
change the mapping to use Promise.allSettled on domains.map(...) so you can
collect fulfilled results and discard or log rejected ones (include domain
identifier where possible), then return only the successfully resolved VmDomain
objects; keep mapDomainStateToVmState unchanged and ensure you log failures from
getInfo/getName/getUUIDString for debugging while still returning the other
domains.
- Around line 200-210: ensureHypervisorAvailable currently swallows the original
error from initializeHypervisorOnce; catch the error, log the original error
(use this.logger.error if a logger exists on the service instance, otherwise
console.error) with context like "initializeHypervisorOnce failed", and then
rethrow a GraphQLError while preserving some original details (e.g., include
error.message in the GraphQLError message or add it to GraphQLError.extensions)
so callers get the generic "VMs are not available" plus the original error info;
update the catch block in ensureHypervisorAvailable to perform the logging and
include the original error details when throwing.

In `@api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts`:
- Around line 220-246: The test "finishes shutdown polling with the original
hypervisor even if shared state is reset" uses vi.useFakeTimers() but only
restores timers at the end of the test, which can leak if the test fails; add a
global cleanup to always restore real timers (e.g., an afterEach that calls
vi.useRealTimers()) so any test that uses vi.useFakeTimers() (including this
spec and helpers around service.onApplicationBootstrap, service.stopVm,
waitForDomainShutdown) will not leave fake timers active across tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f1f73557-700f-4cf6-ba69-58288d89b806

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9fd8c and ba93769.

📒 Files selected for processing (2)
  • api/src/unraid-api/graph/resolvers/vms/vms.service.ts
  • api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts

@elibosley elibosley merged commit e064de7 into main Mar 22, 2026
13 checks passed
@elibosley elibosley deleted the codex/fix-vm-reconnect-availability branch March 22, 2026 18:22
elibosley pushed a commit that referenced this pull request Mar 23, 2026
🤖 I have created a release *beep* *boop*
---


## [4.31.0](v4.30.1...v4.31.0)
(2026-03-23)


### Features

* **api:** support encrypted array start inputs
([#1944](#1944))
([018a8d5](018a8d5))
* **onboarding:** add shared loading states
([#1945](#1945))
([776c8cc](776c8cc))
* Serverside state for onboarding display
([#1936](#1936))
([682d51c](682d51c))


### Bug Fixes

* **api:** reconcile emhttp state without spinning disks
([#1946](#1946))
([d3e0b95](d3e0b95))
* **onboarding:** auto-open incomplete onboarding on 7.3+
([#1940](#1940))
([f0241a8](f0241a8))
* **onboarding:** replace internal boot native selects
([#1942](#1942))
([d6ea032](d6ea032))
* preserve onboarding resume state on reload
([#1941](#1941))
([91f7fe9](91f7fe9))
* recover VM availability after reconnect
([#1947](#1947))
([e064de7](e064de7))
* Unify callback server payloads
([#1938](#1938))
([f58fcc0](f58fcc0))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant