Skip to content

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 9, 2025

Conversation

droudnitsky
Copy link
Contributor

@droudnitsky droudnitsky commented Jun 8, 2025

https://issues.apache.org/jira/browse/HBASE-27781

+Background+

In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded during location resolution here. In that handling, we loop over all actions still being processed in the groupAndSendMulti at the time of the operation timeout being exceeded and set them as failed. The problem is, some number of these actions may have already failed to completion when we get to this spot - if we fail to resolve region location for an action we will fail it to completion in findAllLocationsOrFail (fail to completion == set the error for the action, decrement actions in progress counter, and do not retry the action again) - and we should not "double fail" any actions that were already failed due to failed location resolution because we will decrement the actions in progress counter twice for the same action, and throw off the (atomic) action counter accounting the sync client relies on to tell when the batch operation is complete.

+Problem+

In the for loop here we fail all actions (and decrement action in progress counter for all actions) in the groupAndSendMulti - which includes the aforementioned actions that were already failed through findAllLocationsOrFail - causing us to decrement the actions in progress counter more times than than there are actions if there was a location failure. This causes an assertion error in the actions in progress counter since we go negative here and should never have a negative amount of actions in progress, causing the HBase client to throw an unchecked exception that is not handled within the client which bubbles up to the user application layer invoking the client, which may kill the caller thread/application that invoked the operation that should have timed out with a RetriesExhaustedWithDetails exception (rather than throwing an unchecked AssertionError), as the user application layer should not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

+Triggering scenario/reproduction+

The most common scenario where one could hit this bug is if there is meta slowness when running batch operations. Suppose we have a batch with 3 actions, and on trying to resolve the location for the first action, we timeout repeatedly to the meta table due to meta slowness and consume the entire operation timeout on the meta scan attempts to resolve the location of the first action. In this case, we will fail the first action through  findAllLocationsOrFail which bring the actionsInProgress counter to 2, and then we will loop over all three actions and fail each of them, on the third action failure attempt the actions in progress counter is zero and we attempt to decrement it to -1, and hit the assertion error. This is what the test case in the PR successfully reproduces. 

+Solution+
We still want to fail all remaining/incomplete actions being processed in groupAndSendMulti at the time of the operation timeout being exceeded, because there is no time remaining to execute them, but we need special handling to avoid failing actions which were already failed due to failed location resolution. 

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@droudnitsky
Copy link
Contributor Author

hbase-server test failures do not look related

@Apache9 Apache9 requested a review from Copilot June 21, 2025 14:03
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where actions that already failed due to location resolution are being double-failed during a batch operation timeout, causing the actions in progress counter to go negative.

  • Added a new unit test to validate that actions with location failures aren’t double-failed.
  • Refactored the timeout handling in AsyncRequestFutureImpl.java by introducing a helper method that excludes already failed actions from being failed again.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientOperationTimeout.java Adds a new test case to validate correct handling of operation timeout with mixed action failures.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java Introduces the failIncompleteActionsWithOpTimeout method and updates logic to avoid double failing actions.

Copy link
Contributor

@charlesconnell charlesconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Daniel, thank you for your contribution. Your explanation of the problem and your solution make sense to me. Please fix the type that copilot pointed out, and I can shepherd this fix along.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 45s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 35s Maven dependency ordering for branch
+1 💚 mvninstall 3m 30s branch-2 passed
+1 💚 compile 3m 50s branch-2 passed
+1 💚 checkstyle 0m 56s branch-2 passed
+1 💚 spotbugs 2m 23s branch-2 passed
+1 💚 spotless 0m 48s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 3m 4s the patch passed
+1 💚 compile 3m 55s the patch passed
+1 💚 javac 3m 55s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 16s hbase-client: The patch generated 0 new + 11 unchanged - 1 fixed = 11 total (was 12)
+1 💚 checkstyle 0m 38s The patch passed checkstyle in hbase-server
+1 💚 spotbugs 2m 38s the patch passed
+1 💚 hadoopcheck 17m 17s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚 spotless 0m 43s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 18s The patch does not generate ASF License warnings.
43m 57s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7079
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 7e100f2c6c6b 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7942248
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 2m 41s branch-2 passed
+1 💚 compile 1m 0s branch-2 passed
+1 💚 javadoc 0m 41s branch-2 passed
+1 💚 shadedjars 5m 31s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for patch
+1 💚 mvninstall 2m 29s the patch passed
+1 💚 compile 1m 0s the patch passed
+1 💚 javac 1m 0s the patch passed
+1 💚 javadoc 0m 40s the patch passed
+1 💚 shadedjars 5m 28s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 2s hbase-client in the patch passed.
-1 ❌ unit 214m 11s /patch-unit-hbase-server.txt hbase-server in the patch failed.
247m 49s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux ed56a6a5fe9e 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7942248
Default Java Temurin-1.8.0_412-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/testReport/
Max. process+thread count 4280 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 48s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 46s Maven dependency ordering for branch
+1 💚 mvninstall 3m 40s branch-2 passed
+1 💚 compile 1m 14s branch-2 passed
+1 💚 javadoc 0m 44s branch-2 passed
+1 💚 shadedjars 6m 37s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for patch
+1 💚 mvninstall 3m 6s the patch passed
+1 💚 compile 1m 15s the patch passed
+1 💚 javac 1m 15s the patch passed
+1 💚 javadoc 0m 43s the patch passed
+1 💚 shadedjars 6m 35s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 20s hbase-client in the patch passed.
+1 💚 unit 225m 27s hbase-server in the patch passed.
265m 58s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux fcbcdc7e6f67 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7942248
Default Java Eclipse Adoptium-11.0.23+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/testReport/
Max. process+thread count 4462 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 59s branch-2 passed
+1 💚 compile 2m 9s branch-2 passed
+1 💚 javadoc 1m 15s branch-2 passed
+1 💚 shadedjars 7m 44s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for patch
+1 💚 mvninstall 4m 6s the patch passed
+1 💚 compile 1m 35s the patch passed
+1 💚 javac 1m 35s the patch passed
+1 💚 javadoc 0m 59s the patch passed
+1 💚 shadedjars 7m 18s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 40s hbase-client in the patch passed.
-1 ❌ unit 234m 47s /patch-unit-hbase-server.txt hbase-server in the patch failed.
279m 34s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux 4b83500a8a10 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7942248
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/testReport/
Max. process+thread count 4596 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@charlesconnell charlesconnell merged commit 6aca45b into apache:branch-2 Jul 9, 2025
1 check failed
charlesconnell pushed a commit that referenced this pull request Jul 9, 2025
…f batch operation timeout exceeded (#7079)

Authored by: Daniel Roudnitsky <droudnitsky1@bloomberg.net>
Signed off by: Charles Connell <cconnell@apache.org>
charlesconnell pushed a commit that referenced this pull request Jul 9, 2025
…f batch operation timeout exceeded (#7079)

Authored by: Daniel Roudnitsky <droudnitsky1@bloomberg.net>
Signed off by: Charles Connell <cconnell@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants