New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second part of #1644 fix: C*4 from 4.0.1 to 4.0.3 upgrade #1647
Conversation
Similar to C*3, new version of Cassandra-4 also triggers one new IT failure:
Relevant stack trace seems to be:
|
@tatu-at-datastax Did the analysis of the above exception.. I even reproduced it locally, but it's very hard to do so as it's a race condition bug in the Cassandra
This is a complete mess:
For start I am thinking to create a test to easily reproduce this.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stops from merging, due to the #1647 (comment)
Here's the patch to easily reproduce locally. Confirmed that the same patch is reproducing the same error with C*
|
From parent issue: possibly due to: |
@ivansenic Thank you for digging into this! Do you think it'd make sense to merge the test change via separate PR (I assume it passes against current C3.11.11 / C4.0.1)? Also, do you think C*3.11 failures are due to same root cause or something different? |
@tatu-at-datastax tests are green now, but the stuff was not fixed in the |
Is that client level code or internal? Any chance you can try similar against straight C*, not stargate, to show a Cassandra bug to be fixed? |
Sure let me try. |
This is likely to race no matter how the internals are implemented. diff --git a/persistence-cassandra-4.0/src/main/java/io/stargate/db/cassandra/impl/StargateQueryHandler.java b/persistence-cassandra-4.0/src/main/java/io/stargate/db/cassandra/impl/StargateQueryHandler.java
index fba767be..978afa0b 100644
--- a/persistence-cassandra-4.0/src/main/java/io/stargate/db/cassandra/impl/StargateQueryHandler.java
+++ b/persistence-cassandra-4.0/src/main/java/io/stargate/db/cassandra/impl/StargateQueryHandler.java
@@ -151,8 +151,14 @@ public class StargateQueryHandler implements QueryHandler {
throws RequestValidationException {
ResultMessage.Prepared prepare = QueryProcessor.instance.prepare(s, clientState, map);
Prepared prepared = QueryProcessor.instance.getPrepared(prepare.statementId);
- boolean idempotent = IdempotencyAnalyzer.isIdempotent(prepared.statement);
- boolean useKeyspace = prepared.statement instanceof UseStatement;
+ CQLStatement statement;
+ if (prepared == null) { // Prepared cache evicted the entry, re-parse the statement (slow path)
+ statement = QueryProcessor.getStatement(s, clientState);
+ } else {
+ statement = prepared.statement;
+ }
+ boolean idempotent = IdempotencyAnalyzer.isIdempotent(statement);
+ boolean useKeyspace = statement instanceof UseStatement;
return new PreparedWithInfo(idempotent, useKeyspace, prepare);
} |
But is the prepared == null expected exactly after a call to prepare it? If this was the case until now, why did not we have similar fix until now? I am more for waiting that this is fixed in C*, if this is a bug there.. If not then yea let s have something like this.. |
lol. Agreed, 4.0.3 has no fixes here (afaik) over 4.0.2. |
Tried to reproduce against C* 4.0.2 without Stargate, but I had no luck.. Then I used the same code and fired against Stargate and still it worked fine and no errors where there.. Maybe my test setup was wrong, I used the following code:
This should hit the CQL with high concurrency from the client side.. And I tried to use same statement as in the failing test.. Maybe the CQL driver fixes the statement to be fully qualified.. Not sure really.. UPDATE: OK, I figured out what's going on, this is so hard to reproduce.. Problem is that race condition between two threads needs to happen on a first occasion when the statement is prepared.. The only way to reproduce is that two threads enter the prepare method at the same time when the statement is not prepared.. As soon as one thread executes the method without interference of other, there is no chance to get the error in the future.. I am not sure if I can reproduce this with driver.. I guess we were quite lucky to hit this, and quite lucky to have grpc streaming as I guess without it it would be impossible.. |
@ivansenic I may be wrong but it seems to me that the reproduction you had might not actually reproduce the bug (see my comment on #1655). Fix may well be valid but it'd be important to be able to reproduce the fail first to verify it. In addition to the problem of reproducing the issue with. new tests, I tried to make |
It's super easy to reproduce this on my machine.. Just use my diff and start the INT test locally with the |
With d19a15d builds are failing for Stack trace
|
The 5033209 should fix build with C* |
The a57630c sholuld fix build with C* |
OK preparing problems are solved. However, new issue is unrevealed wrt gRPC now. Seems that gRPC is not implementing retries in case of the
@mpenick I will add a small fix for the
I am sure this has to be valid. |
And just for sake of completeness: my failure to reproduce underlying issue was due to PR (#1655) not containing C*4 upgrade so was testing against 4.0.1 which does not have the issue. |
testing/src/main/java/io/stargate/it/grpc/streaming/ExecuteBatchStreamingTest.java
Show resolved
Hide resolved
persistence-cassandra-4.0/src/main/java/io/stargate/db/cassandra/impl/Conversion.java
Show resolved
Hide resolved
Looks good @ivansenic -- approved. |
CQLStatement statement = | ||
Optional.ofNullable(QueryProcessor.instance.getPrepared(prepare.statementId)) | ||
.map(p -> p.statement) | ||
.orElseGet(() -> QueryProcessor.getStatement(s, clientState)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm following correctly, QueryProcessor.getStatement
re-prepares the statement but does not put it in the cache?
How does the cache eventually get populated? Is it by the other thread we were racing with, or just the next time StargateQueryHandler.prepare
gets called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes QueryProcessor.getStatement
re-parse and re-prepares, with out interacting with the cache. In addition we get a statement back. Instead of the MD5Digest.
Cache will get evicted and then re-populated by another thread. Last thread always inserts to cache.
Also adds work-around/fix for https://issues.apache.org/jira/browse/CASSANDRA-17401 as well as a test to verify.
What this PR does:
Second upgrade for #1644: Cassandra 4.x dependency from 4.0.1 to 4.0.3
Which issue(s) this PR fixes:
Part of #1644.
Checklist