fix: recover VM availability after reconnect#1947
Conversation
WalkthroughRefactors VmsService to centralize hypervisor initialization and cleanup (serialize init via Changes
Sequence DiagramsequenceDiagram
participant Client
participant Service as VmsService
participant Hypervisor
participant FS as FS/Watcher
Client->>Service: getDomains()
Service->>Service: ensureHypervisorAvailable()
Service->>Hypervisor: connectOpen()
Hypervisor-->>Service: connected
Service->>Service: listDomains(hypervisor)
Service->>Hypervisor: listAllDomains()
Hypervisor-->>Service: domains
Service-->>Client: return domains
Note over Hypervisor,Service: connection error on list
Client->>Service: getDomains()
Service->>Service: listDomains(hypervisor)
Service->>Hypervisor: listAllDomains()
Hypervisor--XService: connection error
Service->>Service: resetHypervisorConnection()
Service->>Service: ensureHypervisorAvailable() (retry)
Service->>Hypervisor: connectOpen()
Hypervisor-->>Service: connected
Service->>Service: listDomains(hypervisor)
Service->>Hypervisor: listAllDomains()
Hypervisor-->>Service: domains
Service-->>Client: return domains
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d9fd8c2d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await this.resetHypervisorConnection(); | ||
| hypervisor = await this.ensureHypervisorAvailable(); |
There was a problem hiding this comment.
Avoid dropping the shared hypervisor during a list retry
If getDomains() hits this reconnect branch while a stopVm() or rebootVm() call is already inside waitForDomainShutdown(), resetHypervisorConnection() clears this.hypervisor before the replacement connection is opened. waitForDomainShutdown() reads this.hypervisor on every poll, so the in-flight mutation can start failing with Hypervisor is not initialized halfway through a shutdown/reboot. This only requires a transient connectListAllDomains failure during normal VM polling, so a harmless read-side reconnect can now break write-side VM actions.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1947 +/- ##
==========================================
+ Coverage 51.60% 51.71% +0.11%
==========================================
Files 1025 1025
Lines 70597 70667 +70
Branches 7831 7883 +52
==========================================
+ Hits 36432 36547 +115
+ Misses 34042 33997 -45
Partials 123 123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/src/unraid-api/graph/resolvers/vms/vms.service.ts (2)
371-396:⚠️ Potential issue | 🟡 MinorKeep retry failures inside the existing error wrapper.
Lines 381-382 return the retry directly from inside the
catch, so a failed second attempt bypasses the logging andGraphQLError('Failed to get domains: ...')path below. That changes whatapi/src/unraid-api/graph/resolvers/vms/vms.resolver.tsaround Lines 24-32 ultimately surfaces.🔧 Suggested fix
} catch (error: unknown) { if (this.isConnectionError(error)) { this.logger.warn( `VM domain lookup lost its libvirt connection: ${error.message}. Resetting and retrying once.` ); await this.resetHypervisorConnection(); - hypervisor = await this.ensureHypervisorAvailable(); - return await this.listDomains(hypervisor); + try { + hypervisor = await this.ensureHypervisorAvailable(); + return await this.listDomains(hypervisor); + } catch (retryError) { + this.logger.error( + `Failed to get domains after reconnect: ${ + retryError instanceof Error ? retryError.message : 'Unknown error' + }` + ); + throw new GraphQLError( + `Failed to get domains: ${ + retryError instanceof Error ? retryError.message : 'Unknown error' + }` + ); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 371 - 396, In getDomains, the catch currently returns the retry result directly which bypasses the later logging/GraphQLError path; instead, when isConnectionError(error) is true call resetHypervisorConnection(), re-acquire hypervisor via ensureHypervisorAvailable(), then call listDomains(hypervisor) but do NOT return from inside the catch—capture the result or let any error bubble so the existing downstream logging and the final GraphQLError throw (the existing error-handling block) still runs; adjust the flow in getDomains (referencing ensureHypervisorAvailable, isConnectionError, resetHypervisorConnection, and listDomains) so second-attempt failures are handled by the same logging/throw logic rather than bypassing it.
69-80:⚠️ Potential issue | 🔴 CriticalSerialize all hypervisor initialization paths.
Lines 71, 103, and 179 each enter
initializeHypervisor()independently. If bootstrap/watch recovery overlaps a request while availability is false, both callers can race intoconnectOpen(), and one failure path can then tear down the connection another caller just established. That leaves the reconnect/bootstrap race in place under concurrent traffic.🔧 Suggested direction
+ private hypervisorInitialization: Promise<void> | null = null; + + private async initializeHypervisorOnce(): Promise<void> { + if (!this.hypervisorInitialization) { + this.hypervisorInitialization = (async () => { + try { + await this.initializeHypervisor(); + this.isVmsAvailable = true; + } finally { + this.hypervisorInitialization = null; + } + })(); + } + + await this.hypervisorInitialization; + } + private async attemptHypervisorInitializationAndWatch(): Promise<void> { try { - await this.initializeHypervisor(); - this.isVmsAvailable = true; + await this.initializeHypervisorOnce(); this.logger.debug(`VMs service initialized successfully with URI: ${this.uri}`); await this.setupWatcher(); } catch (error) { await this.resetHypervisorConnection(); @@ .on('add', async () => { @@ try { - await this.initializeHypervisor(); - this.isVmsAvailable = true; + await this.initializeHypervisorOnce(); this.logger.log( 'Hypervisor connection established successfully after PID file detection.' ); } catch (error) { @@ private async ensureHypervisorAvailable(): Promise<InstanceType<typeof HypervisorClass>> { if (this.isVmsAvailable && this.hypervisor) { return this.hypervisor; } try { - await this.initializeHypervisor(); - this.isVmsAvailable = true; + await this.initializeHypervisorOnce(); } catch (error) { await this.resetHypervisorConnection(); throw new GraphQLError('VMs are not available'); }Also applies to: 97-114, 173-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 69 - 80, Multiple callers (attemptHypervisorInitializationAndWatch, request handlers at lines ~97 and ~173) race into initializeHypervisor/connectOpen causing teardown of a connection another caller created; serialize these paths by adding a single shared async lock or in-flight promise guard around initializeHypervisor/connectOpen (e.g., a private initLock or inFlightInitPromise on the VmsService class) so concurrent callers await the same initialization instead of running it in parallel, and inside the guarded section verify connection state (isVmsAvailable/uri or a connection object) after acquiring the lock before performing resetHypervisorConnection or teardown so a failing attempt does not destroy a connection established by another concurrent initializer; apply the same guard to calls from attemptHypervisorInitializationAndWatch, bootstrap watcher paths, and any request entrypoints that call initializeHypervisor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 162-170: The resetHypervisorConnection method should clear shared
state before awaiting teardown to avoid races: capture the current hypervisor
instance into a local variable (e.g., const old = this.hypervisor), immediately
set this.hypervisor = null and this.isVmsAvailable = false, then call await
old?.connectClose() inside the try/catch so overlapping calls to
ensureHypervisorAvailable() won't receive a handle that's being torn down;
retain the existing error logging for connectClose failures.
---
Outside diff comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 371-396: In getDomains, the catch currently returns the retry
result directly which bypasses the later logging/GraphQLError path; instead,
when isConnectionError(error) is true call resetHypervisorConnection(),
re-acquire hypervisor via ensureHypervisorAvailable(), then call
listDomains(hypervisor) but do NOT return from inside the catch—capture the
result or let any error bubble so the existing downstream logging and the final
GraphQLError throw (the existing error-handling block) still runs; adjust the
flow in getDomains (referencing ensureHypervisorAvailable, isConnectionError,
resetHypervisorConnection, and listDomains) so second-attempt failures are
handled by the same logging/throw logic rather than bypassing it.
- Around line 69-80: Multiple callers (attemptHypervisorInitializationAndWatch,
request handlers at lines ~97 and ~173) race into
initializeHypervisor/connectOpen causing teardown of a connection another caller
created; serialize these paths by adding a single shared async lock or in-flight
promise guard around initializeHypervisor/connectOpen (e.g., a private initLock
or inFlightInitPromise on the VmsService class) so concurrent callers await the
same initialization instead of running it in parallel, and inside the guarded
section verify connection state (isVmsAvailable/uri or a connection object)
after acquiring the lock before performing resetHypervisorConnection or teardown
so a failing attempt does not destroy a connection established by another
concurrent initializer; apply the same guard to calls from
attemptHypervisorInitializationAndWatch, bootstrap watcher paths, and any
request entrypoints that call initializeHypervisor.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c26f0dd1-bee3-4123-a1e6-f368ae899457
📒 Files selected for processing (2)
api/src/unraid-api/graph/resolvers/vms/vms.service.tsapi/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts
|
This plugin has been deployed to Cloudflare R2 and is available for testing. |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
api/src/unraid-api/graph/resolvers/vms/vms.service.ts (2)
426-450: Potential for partial failures during domain mapping.
Promise.all()will reject immediately if any single domain'sgetInfo(),getName(), orgetUUIDString()fails, losing results from domains that succeeded. If one domain is in a transient state, this could cause the entire listing to fail.♻️ Consider using Promise.allSettled for resilience
- return Promise.all( - domains.map(async (domain) => { - const info = await domain.getInfo(); - const name = await domain.getName(); - const uuid = await domain.getUUIDString(); - const state = this.mapDomainStateToVmState(info.state); - - return { - id: uuid, - uuid, - name, - state, - }; - }) - ); + const results = await Promise.allSettled( + domains.map(async (domain) => { + const info = await domain.getInfo(); + const name = await domain.getName(); + const uuid = await domain.getUUIDString(); + const state = this.mapDomainStateToVmState(info.state); + + return { + id: uuid, + uuid, + name, + state, + }; + }) + ); + + const fulfilled = results.filter( + (r): r is PromiseFulfilledResult<VmDomain> => r.status === 'fulfilled' + ); + const rejected = results.filter((r) => r.status === 'rejected'); + + if (rejected.length > 0) { + this.logger.warn(`Failed to get info for ${rejected.length} domain(s)`); + } + + return fulfilled.map((r) => r.value);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 426 - 450, The listDomains method currently uses Promise.all which fails fast if any domain.getInfo(), getName(), or getUUIDString() rejects; change the mapping to use Promise.allSettled on domains.map(...) so you can collect fulfilled results and discard or log rejected ones (include domain identifier where possible), then return only the successfully resolved VmDomain objects; keep mapDomainStateToVmState unchanged and ensure you log failures from getInfo/getName/getUUIDString for debugging while still returning the other domains.
200-210: Consider logging the original error before wrapping.When
initializeHypervisorOnce()fails, the original error is swallowed and replaced with a generic "VMs are not available" message. This could make debugging harder when the root cause isn't obvious.🔧 Suggested improvement
private async ensureHypervisorAvailable(): Promise<InstanceType<typeof HypervisorClass>> { if (this.isVmsAvailable && this.hypervisor) { return this.hypervisor; } try { return await this.initializeHypervisorOnce(); } catch (error) { + this.logger.debug(`Hypervisor initialization failed: ${(error as Error).message}`); throw new GraphQLError('VMs are not available'); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts` around lines 200 - 210, ensureHypervisorAvailable currently swallows the original error from initializeHypervisorOnce; catch the error, log the original error (use this.logger.error if a logger exists on the service instance, otherwise console.error) with context like "initializeHypervisorOnce failed", and then rethrow a GraphQLError while preserving some original details (e.g., include error.message in the GraphQLError message or add it to GraphQLError.extensions) so callers get the generic "VMs are not available" plus the original error info; update the catch block in ensureHypervisorAvailable to perform the logging and include the original error details when throwing.api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts (1)
220-246: Good test: Validates shutdown polling uses captured hypervisor.This test ensures that
waitForDomainShutdowncontinues polling with the original hypervisor instance even when shared state is reset mid-operation. Proper use of fake timers.One minor note: ensure
vi.useRealTimers()is called even if the test fails to avoid affecting other tests.♻️ Consider using afterEach for timer cleanup
You could add an
afterEachhook to ensure timers are always restored:afterEach(() => { vi.useRealTimers(); });This would make the cleanup more robust if other tests start using fake timers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts` around lines 220 - 246, The test "finishes shutdown polling with the original hypervisor even if shared state is reset" uses vi.useFakeTimers() but only restores timers at the end of the test, which can leak if the test fails; add a global cleanup to always restore real timers (e.g., an afterEach that calls vi.useRealTimers()) so any test that uses vi.useFakeTimers() (including this spec and helpers around service.onApplicationBootstrap, service.stopVm, waitForDomainShutdown) will not leave fake timers active across tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.ts`:
- Around line 426-450: The listDomains method currently uses Promise.all which
fails fast if any domain.getInfo(), getName(), or getUUIDString() rejects;
change the mapping to use Promise.allSettled on domains.map(...) so you can
collect fulfilled results and discard or log rejected ones (include domain
identifier where possible), then return only the successfully resolved VmDomain
objects; keep mapDomainStateToVmState unchanged and ensure you log failures from
getInfo/getName/getUUIDString for debugging while still returning the other
domains.
- Around line 200-210: ensureHypervisorAvailable currently swallows the original
error from initializeHypervisorOnce; catch the error, log the original error
(use this.logger.error if a logger exists on the service instance, otherwise
console.error) with context like "initializeHypervisorOnce failed", and then
rethrow a GraphQLError while preserving some original details (e.g., include
error.message in the GraphQLError message or add it to GraphQLError.extensions)
so callers get the generic "VMs are not available" plus the original error info;
update the catch block in ensureHypervisorAvailable to perform the logging and
include the original error details when throwing.
In `@api/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts`:
- Around line 220-246: The test "finishes shutdown polling with the original
hypervisor even if shared state is reset" uses vi.useFakeTimers() but only
restores timers at the end of the test, which can leak if the test fails; add a
global cleanup to always restore real timers (e.g., an afterEach that calls
vi.useRealTimers()) so any test that uses vi.useFakeTimers() (including this
spec and helpers around service.onApplicationBootstrap, service.stopVm,
waitForDomainShutdown) will not leave fake timers active across tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f1f73557-700f-4cf6-ba69-58288d89b806
📒 Files selected for processing (2)
api/src/unraid-api/graph/resolvers/vms/vms.service.tsapi/src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts
🤖 I have created a release *beep* *boop* --- ## [4.31.0](v4.30.1...v4.31.0) (2026-03-23) ### Features * **api:** support encrypted array start inputs ([#1944](#1944)) ([018a8d5](018a8d5)) * **onboarding:** add shared loading states ([#1945](#1945)) ([776c8cc](776c8cc)) * Serverside state for onboarding display ([#1936](#1936)) ([682d51c](682d51c)) ### Bug Fixes * **api:** reconcile emhttp state without spinning disks ([#1946](#1946)) ([d3e0b95](d3e0b95)) * **onboarding:** auto-open incomplete onboarding on 7.3+ ([#1940](#1940)) ([f0241a8](f0241a8)) * **onboarding:** replace internal boot native selects ([#1942](#1942)) ([d6ea032](d6ea032)) * preserve onboarding resume state on reload ([#1941](#1941)) ([91f7fe9](91f7fe9)) * recover VM availability after reconnect ([#1947](#1947)) ([e064de7](e064de7)) * Unify callback server payloads ([#1938](#1938)) ([f58fcc0](f58fcc0)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
getDomains()once when the libvirt connection drops mid-requestRoot cause
The VM service cached availability in
isVmsAvailableand only updated that state from bootstrap and PID-file watcher events. During reconnect/bootstrap races, that flag could stay false even after libvirt was healthy again, so the dashboard surfacedFailed to retrieve VM domains: VMs are not availableinstead of self-healing.Testing
pnpm exec vitest run src/unraid-api/graph/resolvers/vms/vms.service.unit.spec.ts src/unraid-api/graph/resolvers/vms/vms.resolver.spec.tspnpm exec tsc --noEmit -p tsconfig.jsonSummary by CodeRabbit
Bug Fixes
Refactor
Tests