Exit JVM and dump heap on OutOfMemoryError for server containers#1683
Merged
Exit JVM and dump heap on OutOfMemoryError for server containers#1683
Conversation
Add -XX:+ExitOnOutOfMemoryError, -XX:+HeapDumpOnOutOfMemoryError, and -XX:HeapDumpPath=/dump to the long-running server JVMs (data, api, submit, sched, db, rest). On OOM the JVM now writes a heap dump to /dump and exits cleanly so K8s recreates the container, instead of limping along in undefined state. Motivated by the 2026-05-04 / 2026-05-06 prod data-pod wedges. Both were precipitated by Java heap space errors which corrupted the JVM-wide static InactivityMonitor Timer (a TimerTask hit OOM mid-run and the TimerThread silently terminated). With these flags, the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched, eliminating the OOM-driven path to the failover-wedge condition entirely. The companion JmsFailoverWatchdog (in PR #1681) remains as defense-in-depth for non-OOM wedge causes. Targeted scope: - 5 swarm-style server Dockerfiles (data, api, submit, sched, db) — add the three flags right after -XX:MaxRAMPercentage=80. - 2 Quarkus rest Dockerfiles (Dockerfile.jvm, Dockerfile.legacy-jar) — append to JAVA_OPTS. Skipped intentionally: - Dockerfile-batch-dev (per-job SLURM lifecycle, no /dump volume). - Dockerfile-clientgen-dev / docker/build/admin (build/utility tools). - Dockerfile.native and Dockerfile.native-micro (Quarkus native image; -XX: flags don't apply to GraalVM native builds). Companion change required in vcell-fluxcd: add a /dump emptyDir volume mount to each affected Deployment so the JVM has somewhere to write the dump. Without that mount the dump silently fails and the JVM still exits — log signal still works, just no postmortem artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
May 7, 2026
Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20 and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker. Without this the FailoverTransport reconnects forever, which is wrong behavior in K8s where pod restart is the right response to a sustained broker outage. Add JmsFailoverWatchdog: a TransportListener attached to each ActiveMQConnection that runs a caller-supplied Runnable when the failover layer reports a terminal IOException. The terminal action is constructor- injected so production wiring stays visible at the composition root and tests can substitute their own handler. Two factory methods: logOnly() (the default — log lifecycle events but take no further action) and jvmExitOnTerminal() (escape hatch for any future service that wants K8s pod recycle on terminal transport failure). VCMessagingServiceJms holds a watchdog field with a setter; the default is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal in this change — the setter is the escape hatch for any future need. Wired into MessageProducerSessionJms and ConsumerContextJms — the two long-lived JMS connection sites. Short-lived batch processes (OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped. Why this is defense-in-depth, not the primary fix: the OOM-driven wedge mechanism that originally motivated this work — a TimerTask hitting OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat Timer, killing the TimerThread silently and corrupting the failover transport for the rest of the JVM lifetime — is closed off by the -XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched). The watchdog covers non-OOM terminal-failover paths: sustained network partition, broker maintenance > 8 min, or any future client regression in the static-Timer / static-counter design (still present in 5.18.x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
-XX:+ExitOnOutOfMemoryError,-XX:+HeapDumpOnOutOfMemoryError, and-XX:HeapDumpPath=/dumpto the long-running server JVMs (data, api, submit, sched, db, rest). OnOutOfMemoryErrorthe JVM now writes a heap dump to/dumpand exits cleanly so K8s recreates the container, instead of limping along in undefined state.Why
Confirmed via 213 days of Loki retention scan that the prod data-pod wedges on 2026-05-04 and 2026-05-06 were both precipitated by
Java heap spaceerrors corrupting the JVM-wide staticInactivityMonitor.READ_CHECK_TIMERmid-TimerTask:With
ExitOnOutOfMemoryError, the JVM aborts on the first OOM before the InactivityMonitor's TimerThread can be touched, eliminating the OOM-driven path to the wedge entirely. The companionJmsFailoverWatchdogin #1681 remains as defense-in-depth for non-OOM wedge causes.HeapDumpOnOutOfMemoryErrorwrites a.hproffile to/dumpimmediately before the JVM aborts — keeps a postmortem artifact for analysis (Eclipse MAT, VisualVM, JXRay, etc.) so we can find what's actually consuming the heap.Companion PR (already merged)
virtualcell/vcell-fluxcdadds a/dumpemptyDirvolume mount (sizeLimit 4Gi, default node-disk backing) to all six affected Deployments and bumps the data container'sresources.limits.memoryfrom3000Mito8000Mi. That PR is merged; without these JVM flags it's simply a latent mount and extra heap headroom — no behavior change. This PR activates the actual exit-on-OOM behavior on the next image roll.Scope
docker/build/Dockerfile-data-dev-XX:MaxRAMPercentage=80docker/build/Dockerfile-api-devdocker/build/Dockerfile-submit-devdocker/build/Dockerfile-sched-devdocker/build/Dockerfile-db-devvcell-rest/src/main/docker/Dockerfile.jvmJAVA_OPTSvcell-rest/src/main/docker/Dockerfile.legacy-jarJAVA_OPTSSkipped intentionally:
Dockerfile-batch-dev(per-job SLURM lifecycle, no/dumpvolume),Dockerfile-clientgen-devanddocker/build/admin(build/utility tools),Dockerfile.nativeandDockerfile.native-micro(Quarkus native image —-XX:flags don't apply to GraalVM builds).Test plan
git diff --statconfirms 7 files, 20 inserts / 2 deletesgrepverification that all 3 flags landed in each of the 7 Dockerfileskubectl -n dev exec deployment/data -- ls -la /dumpshows the directory empty and writable.Aborting due to java.lang.OutOfMemoryError/dump/java_pid1.hprof(usekubectl cpto retrieve)kubectl describe podand in Loki'skubectlcontainer stream🤖 Generated with Claude Code