Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExecuteBatchStreamingTest fails sporadically for the C* 4.0 #1679

Closed
ivansenic opened this issue Mar 7, 2022 · 10 comments · Fixed by #1688
Closed

ExecuteBatchStreamingTest fails sporadically for the C* 4.0 #1679

ivansenic opened this issue Mar 7, 2022 · 10 comments · Fixed by #1688
Labels
bug Something isn't working

Comments

@ivansenic
Copy link
Contributor

Unfortunately, the gRPC test ExecuteBatchStreamingTest.manyStreamingBatch fails sporadically even after merging the #1661. Seems that two retries are not enough and that we can hit the situation that both retries also fail.
I don't see any other solution, than to increase the number of retries. We might leave it being limited, but maybe 2 is too less.

@ivansenic ivansenic added the bug Something isn't working label Mar 7, 2022
@ivansenic
Copy link
Contributor Author

@mpenick what do you think?

@tatu-at-datastax
Copy link
Collaborator

tatu-at-datastax commented Mar 7, 2022

Is there any delay/wait between retries? 2 may be too low but as importantly retries usually should not be back-to-back. Especially since in this case failure is immediate I think (no external calls, access to cache is very fast).

@ivansenic
Copy link
Contributor Author

There is no delay as there was no option to add this in the existing grpc retry policy and the implementation around it..

@tatu-at-datastax
Copy link
Collaborator

tatu-at-datastax commented Mar 7, 2022

@ivansenic that makes sense, was guessing that would be the case. Not sure how but it seems important here that there's no busy loop attempt because that may take a very high repeat count without guarantees.

I guess we could start with higher retry count, still, but without some sort of "Thread.yield()" equivalent (that is, something similar that would work in this context) not sure how much simple increase of back-to-back retries would help.
Although maybe gRPC is smart enough to do something like this by default (and not just assume there must be slower external call or I/O)

@ivansenic
Copy link
Contributor Author

@tatu-at-datastax I think since we are using the direct executor in gRPC, there must not be any thead blocks.. So no this is not an option..

@mpenick
Copy link
Contributor

mpenick commented Mar 8, 2022

I guess I need to understand why it's being invalidated so much that it's returning unprepared so often. Is that something that can be fixed?

@mpenick
Copy link
Contributor

mpenick commented Mar 8, 2022

I see: https://github.com/apache/cassandra/blob/cassandra-4.0.3/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L580-L583

So if the same statement is prepared at the same time either one of those checks could fail because of a simultaneous eviction. This is not ideal, but I think the way to fix this is to synchronize prepares, at least in gRPC, so that it reduces the chance of eviction. However, this is likely to affect all API that eagerly prepare, so maybe all prepare could be synchronized for the C* 4.X persistence?

@tatu-at-datastax
Copy link
Collaborator

@ivansenic yes, understood, was expecting that sleep/yield won't work. But effectively we need something like that (but not those :) ).

@tatu-at-datastax
Copy link
Collaborator

@mpenick Wow. That cache handling logic is seriously convoluted, with fixes wrt keyspace/no-keyspace cases and all. It is difficult to trust that to work for all cases...

But if that is to be retained (which for performance I assume needs to), instead of global sync lock it is probably possible to split lock into N different ones on module of hash code.

@mpenick
Copy link
Contributor

mpenick commented Mar 9, 2022

But if that is to be retained (which for performance I assume needs to), instead of global sync lock it is probably possible to split lock into N different ones on module of hash code.

Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants