Skip to content

(fix): release workflow binary memory after use during init#21525

Open
justinkaseman wants to merge 1 commit intodevelopfrom
fix/engine-v2-init-memory
Open

(fix): release workflow binary memory after use during init#21525
justinkaseman wants to merge 1 commit intodevelopfrom
fix/engine-v2-init-memory

Conversation

@justinkaseman
Copy link
Contributor

@justinkaseman justinkaseman commented Mar 14, 2026

Problem

During startup all workflows are loaded concurrently (up to maxConcurrency=12 in flight simultaneously). Each goroutine calls tryEngineCreate, which:

  1. Decodes spec.Workflow (hex-encoded compressed WASM → 2× the compressed binary size as a Go string)
  2. hex.DecodeString(decodedBinary) (the compressed binary, 1× size)
  3. Calls engineFactory where host.NewModule compiles the binary into the wasmtime engine
  4. Blocks on initDone while trigger subscriptions complete

spec.Workflow and decodedBinary are both pinned in tryEngineCreate's stack frame throughout the entire initDone waiting on trigger subscriptions (seconds per workflow).

12 goroutines x 3x compressed WASM each = 36x compressed WASM pinned continuously in Go heap

This can cause problems if we are running close to the maximum workflows.

Fix

This change releases them after they are finished being used milliseconds into the 15-second window.

After each batch the GC is also prompted to clean up.

@justinkaseman justinkaseman requested a review from a team as a code owner March 14, 2026 04:28
@github-actions
Copy link
Contributor

👋 justinkaseman, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

✅ No conflicts with other open PRs targeting develop

@github-actions
Copy link
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM (touches workflow engine creation path; small diff but affects startup behavior and memory/liveness characteristics)

Reduces memory retained during concurrent workflow engine initialization by clearing large binary references earlier in tryEngineCreate, so they don’t remain reachable while waiting for initDone.

Changes:

  • Clear spec.Workflow (hex-encoded binary) immediately after decoding to []byte.
  • Clear decodedBinary after engineFactory returns to shorten the lifetime of the local reference.

@justinkaseman justinkaseman requested a review from a team March 14, 2026 05:07
@cl-sonarqube-production
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
4.3% Technical Debt Ratio on New Code (required ≤ 4%)

See analysis details on SonarQube

@trunk-io
Copy link

trunk-io bot commented Mar 14, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@mchain0
Copy link
Contributor

mchain0 commented Mar 16, 2026

the spikes were observed even before introducing concurrency. still the diagnosis and the solution seems like a proper direction. it would be good to present a load-tests chart confirming smoother memory profile on reboot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants