Skip to content

chore: avoid IT job hangs caused by orphan Spring Boot JVM#24280

Merged
mcollovati merged 1 commit intomainfrom
fix-ci-hangs
May 7, 2026
Merged

chore: avoid IT job hangs caused by orphan Spring Boot JVM#24280
mcollovati merged 1 commit intomainfrom
fix-ci-hangs

Conversation

@ZheSun88
Copy link
Copy Markdown
Contributor

@ZheSun88 ZheSun88 commented May 6, 2026

Summary

Validation runs occasionally lose ~30 minutes on it-tests (11, ...) because
the step runs to the job timeout instead of finishing — costing a full re-run
attempt to produce green. Example:
https://github.com/vaadin/flow/actions/runs/25435931074/job/74614298073
(attempt 1 cancelled at 30:00, attempt 2 succeeded.)

Current behavior (root cause)

The flow-test-redeployment module starts its app via
spring-boot-maven-plugin:start (pre-IT) and stops it via :stop (post-IT).
Spring DevTools is on the classpath and the test churns class files (touching
Application.class to trigger reloads), which in this run produced 8
DevTools-driven restarts in 20 seconds
, the last firing 442 ms before
spring-boot:stop ran:

12:47:56.241 [File Watcher] Restarting due to 1 class path change ...
12:47:56.683 [INFO] --- spring-boot-maven-plugin:4.1.0-RC1:stop ...
[ERROR] Spring application lifecycle JMX bean not found.
Could not stop application gracefully:
org.springframework.boot:type=Admin,name=SpringApplication

stop queries the admin MBean exactly once and there is no retry; during a
restart the bean is briefly unregistered, so the call fails. The forked
Spring Boot JVM (PID 3969 in this run) is therefore never killed. Maven
moves on under -fae, finishes 16 more modules, and exits with BUILD
FAILURE — but the orphan JVM still holds the write end of the pipe inherited
from mvn ... | tee -a mvn-it-tests-11.out (.github/workflows/validation.yml).
tee cannot reach EOF, the bash pipeline cannot complete, and the job
hangs until GitHub force-cancels at the 30-min job timeout. The runner
cleanup confirms the orphan: Terminate orphan process: pid (3969) (java).

Fix

Two small, independent changes:

  1. flow-tests/test-redeployment/pom.xml — insert a maven-antrun-plugin
    execution bound to post-integration-test (declared before
    spring-boot-maven-plugin so it runs first in that phase) that sleeps 5 seconds
    before spring-boot:stop runs. This module's tests deliberately trigger DevTools
    restarts; the original failure was a 442 ms race between the last restart and
    stop-server querying the admin JMX bean. The 5-second quiet period lets
    DevTools finish re-registering the bean so stop-server can shut the app down
    cleanly instead of leaving an orphan JVM behind.

  2. .github/workflows/validation.yml — add timeout-minutes: 22 on the
    Run ITs step (the job-level 30-min cap stays as the outer fence). If a
    future plugin produces a similar shape of hang, the step now fails after
    22 min instead of 30, leaving room within the original job budget for
    report packaging, artifact upload, and the existing run_attempt > 1
    retry path.

  3. .github/workflows/validation.yml — replace mvn ... | tee ...out with file
    redirection (>"$LOG" 2>&1 + tail -f --pid) so an orphan child JVM holding
    an inherited stdout fd can no longer pin the step open. Add a follow-up Kill
    leftover Java processes step (if: always()) that SIGTERM/SIGKILLs any remaining
    java processes before report packaging — defensive cleanup against any
    future plugin that leaks a JVM.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Test Results

 1 404 files  ±0   1 404 suites  ±0   1h 16m 42s ⏱️ -46s
10 127 tests ±0  10 057 ✅ ±0  70 💤 ±0  0 ❌ ±0 
10 602 runs  ±0  10 523 ✅ ±0  79 💤 ±0  0 ❌ ±0 

Results for commit cd2a39e. ± Comparison against base commit 8102527.

♻️ This comment has been updated with latest results.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Copy link
Copy Markdown
Collaborator

@mcollovati mcollovati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit tricky, but hopefully it fixes the issue 😄

@mcollovati mcollovati added this pull request to the merge queue May 7, 2026
Merged via the queue into main with commit 667f5b3 May 7, 2026
31 checks passed
@mcollovati mcollovati deleted the fix-ci-hangs branch May 7, 2026 06:48
ZheSun88 added a commit that referenced this pull request May 8, 2026
## Summary

Validation runs occasionally lose ~30 minutes on `it-tests (11, ...)`
because
the step runs to the job timeout instead of finishing — costing a full
re-run
  attempt to produce green. Example:
https://github.com/vaadin/flow/actions/runs/25435931074/job/74614298073
  (attempt 1 cancelled at 30:00, attempt 2 succeeded.)

  ## Current behavior (root cause)

  The `flow-test-redeployment` module starts its app via
`spring-boot-maven-plugin:start` (pre-IT) and stops it via `:stop`
(post-IT).
Spring DevTools is on the classpath and the test churns class files
(touching
`Application.class` to trigger reloads), which in this run produced **8
DevTools-driven restarts in 20 seconds**, the last firing 442 ms before
  `spring-boot:stop` ran:

  12:47:56.241 [File Watcher] Restarting due to 1 class path change ...
  12:47:56.683 [INFO] --- spring-boot-maven-plugin:4.1.0-RC1:stop ...
  [ERROR] Spring application lifecycle JMX bean not found.
          Could not stop application gracefully:
          org.springframework.boot:type=Admin,name=SpringApplication

`stop` queries the admin MBean exactly once and there is no retry;
during a
restart the bean is briefly unregistered, so the call fails. The forked
Spring Boot JVM (PID 3969 in this run) is therefore **never killed**.
Maven
  moves on under `-fae`, finishes 16 more modules, and exits with BUILD
FAILURE — but the orphan JVM still holds the write end of the pipe
inherited
from `mvn ... | tee -a mvn-it-tests-11.out`
(`.github/workflows/validation.yml`).
  `tee` cannot reach EOF, the bash pipeline cannot complete, and the job
  hangs until GitHub force-cancels at the 30-min job timeout. The runner
cleanup confirms the orphan: `Terminate orphan process: pid (3969)
(java)`.

  ## Fix

  Two small, independent changes:

1. **`flow-tests/test-redeployment/pom.xml`** — insert a
maven-antrun-plugin
      execution bound to post-integration-test (declared before 
spring-boot-maven-plugin so it runs first in that phase) that sleeps 5
seconds
before spring-boot:stop runs. This module's tests deliberately trigger
DevTools
restarts; the original failure was a 442 ms race between the last
restart and
stop-server querying the admin JMX bean. The 5-second quiet period lets
DevTools finish re-registering the bean so stop-server can shut the app
down
      cleanly instead of leaving an orphan JVM behind.

2. **`.github/workflows/validation.yml`** — add `timeout-minutes: 22` on
the
`Run ITs` step (the job-level 30-min cap stays as the outer fence). If a
future plugin produces a similar shape of hang, the step now fails after
22 min instead of 30, leaving room within the original job budget for
report packaging, artifact upload, and the existing `run_attempt > 1`
     retry path.

3. **`.github/workflows/validation.yml`** — replace mvn ... | tee ...out
with file
redirection (>"$LOG" 2>&1 + tail -f --pid) so an orphan child JVM
holding
an inherited stdout fd can no longer pin the step open. Add a follow-up
Kill
leftover Java processes step (if: always()) that SIGTERM/SIGKILLs any
remaining
java processes before report packaging — defensive cleanup against any
      future plugin that leaks a JVM.
ZheSun88 added a commit that referenced this pull request May 8, 2026
## Summary

Validation runs occasionally lose ~30 minutes on `it-tests (11, ...)`
because
the step runs to the job timeout instead of finishing — costing a full
re-run
  attempt to produce green. Example:
https://github.com/vaadin/flow/actions/runs/25435931074/job/74614298073
  (attempt 1 cancelled at 30:00, attempt 2 succeeded.)

  ## Current behavior (root cause)

  The `flow-test-redeployment` module starts its app via
`spring-boot-maven-plugin:start` (pre-IT) and stops it via `:stop`
(post-IT).
Spring DevTools is on the classpath and the test churns class files
(touching
`Application.class` to trigger reloads), which in this run produced **8
DevTools-driven restarts in 20 seconds**, the last firing 442 ms before
  `spring-boot:stop` ran:

  12:47:56.241 [File Watcher] Restarting due to 1 class path change ...
  12:47:56.683 [INFO] --- spring-boot-maven-plugin:4.1.0-RC1:stop ...
  [ERROR] Spring application lifecycle JMX bean not found.
          Could not stop application gracefully:
          org.springframework.boot:type=Admin,name=SpringApplication

`stop` queries the admin MBean exactly once and there is no retry;
during a
restart the bean is briefly unregistered, so the call fails. The forked
Spring Boot JVM (PID 3969 in this run) is therefore **never killed**.
Maven
  moves on under `-fae`, finishes 16 more modules, and exits with BUILD
FAILURE — but the orphan JVM still holds the write end of the pipe
inherited
from `mvn ... | tee -a mvn-it-tests-11.out`
(`.github/workflows/validation.yml`).
  `tee` cannot reach EOF, the bash pipeline cannot complete, and the job
  hangs until GitHub force-cancels at the 30-min job timeout. The runner
cleanup confirms the orphan: `Terminate orphan process: pid (3969)
(java)`.

  ## Fix

  Two small, independent changes:

1. **`flow-tests/test-redeployment/pom.xml`** — insert a
maven-antrun-plugin
      execution bound to post-integration-test (declared before 
spring-boot-maven-plugin so it runs first in that phase) that sleeps 5
seconds
before spring-boot:stop runs. This module's tests deliberately trigger
DevTools
restarts; the original failure was a 442 ms race between the last
restart and
stop-server querying the admin JMX bean. The 5-second quiet period lets
DevTools finish re-registering the bean so stop-server can shut the app
down
      cleanly instead of leaving an orphan JVM behind.

2. **`.github/workflows/validation.yml`** — add `timeout-minutes: 22` on
the
`Run ITs` step (the job-level 30-min cap stays as the outer fence). If a
future plugin produces a similar shape of hang, the step now fails after
22 min instead of 30, leaving room within the original job budget for
report packaging, artifact upload, and the existing `run_attempt > 1`
     retry path.

3. **`.github/workflows/validation.yml`** — replace mvn ... | tee ...out
with file
redirection (>"$LOG" 2>&1 + tail -f --pid) so an orphan child JVM
holding
an inherited stdout fd can no longer pin the step open. Add a follow-up
Kill
leftover Java processes step (if: always()) that SIGTERM/SIGKILLs any
remaining
java processes before report packaging — defensive cleanup against any
      future plugin that leaks a JVM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants