Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: SureralDB can't survive TiKV cluster restart #3570

Open
2 tasks done
pashinin opened this issue Feb 23, 2024 · 3 comments
Open
2 tasks done

Bug: SureralDB can't survive TiKV cluster restart #3570

pashinin opened this issue Feb 23, 2024 · 3 comments
Assignees
Labels
bug Something isn't working topic:stability This is related to stability issues: abnormal resource usage, unexpected errors, etc

Comments

@pashinin
Copy link

pashinin commented Feb 23, 2024

Describe the bug

SurrealDB lost connection and can't reconnect normally. Logs full of:

2024-02-23T06:52:12.567638Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:22.569187Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:32.570532Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:42.572181Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:52.574179Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:53:02.576163Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found

Same error in a shell. The worst part is that /health endpoint returns OK 200.

If I manually restart SurrealDB - everything is ok.

Steps to reproduce

  1. stop tikv cluster
  2. start tikv cluster

Expected behaviour

SurrealDB can reconnect.

SurrealDB version

1.2.1 for linux on x86_64

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@pashinin pashinin added bug Something isn't working triage This issue is new labels Feb 23, 2024
@ioannist
Copy link

We ran into the same error after an extended async operation that failed because of it #3408

I assume this happens when TiKV decides to change to another leader node for whatever reason.

@sgirones
Copy link
Member

thanks for reporting, we are aware of it and will work on a fix

@sgirones sgirones added topic:stability This is related to stability issues: abnormal resource usage, unexpected errors, etc and removed triage This issue is new labels Feb 23, 2024
@sgirones sgirones self-assigned this Feb 23, 2024
@manelmontilla
Copy link

After some time digging into it, it looks the issue is similar to what this PR in the TIKV Rust client tries solve: tikv/client-rust#445. But the the problem is that this PR seems to only try to fix the issue when the client is created during the server startup, but not for when a new the leader of a region is elected. The PR mentions that for these other cases, that is: after the client is created, it should be the application using the client the one retrying on the errors. So probably, we should modify the TIKV Datastore to gracefully handle those errors by recreating the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topic:stability This is related to stability issues: abnormal resource usage, unexpected errors, etc
Projects
None yet
Development

No branches or pull requests

4 participants