Bug: SureralDB can't survive TiKV cluster restart #3570

pashinin · 2024-02-23T07:12:11Z

Describe the bug

SurrealDB lost connection and can't reconnect normally. Logs full of:

2024-02-23T06:52:12.567638Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:22.569187Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:32.570532Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:42.572181Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:52:52.574179Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found
2024-02-23T06:53:02.576163Z ERROR surreal::node: Error running node agent tick: There was a problem with a datastore transaction: Leader of region 5 is not found

Same error in a shell. The worst part is that /health endpoint returns OK 200.

If I manually restart SurrealDB - everything is ok.

Steps to reproduce

stop tikv cluster
start tikv cluster

Expected behaviour

SurrealDB can reconnect.

SurrealDB version

1.2.1 for linux on x86_64

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ioannist · 2024-02-23T08:41:19Z

We ran into the same error after an extended async operation that failed because of it #3408

I assume this happens when TiKV decides to change to another leader node for whatever reason.

sgirones · 2024-02-23T13:03:09Z

thanks for reporting, we are aware of it and will work on a fix

manelmontilla · 2024-04-16T10:51:07Z

After some time digging into it, it looks the issue is similar to what this PR in the TIKV Rust client tries solve: tikv/client-rust#445. But the the problem is that this PR seems to only try to fix the issue when the client is created during the server startup, but not for when a new the leader of a region is elected. The PR mentions that for these other cases, that is: after the client is created, it should be the application using the client the one retrying on the errors. So probably, we should modify the TIKV Datastore to gracefully handle those errors by recreating the client.

pashinin added bug Something isn't working triage This issue is new labels Feb 23, 2024

sgirones added topic:stability This is related to stability issues: abnormal resource usage, unexpected errors, etc and removed triage This issue is new labels Feb 23, 2024

sgirones self-assigned this Feb 23, 2024

sgirones mentioned this issue Feb 23, 2024

Bug: Define aggregate view -> peer is not leader for region 4635, leader may None #3408

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: SureralDB can't survive TiKV cluster restart #3570

Bug: SureralDB can't survive TiKV cluster restart #3570

pashinin commented Feb 23, 2024 •

edited

ioannist commented Feb 23, 2024

sgirones commented Feb 23, 2024

manelmontilla commented Apr 16, 2024

Bug: SureralDB can't survive TiKV cluster restart #3570

Bug: SureralDB can't survive TiKV cluster restart #3570

Comments

pashinin commented Feb 23, 2024 • edited

Describe the bug

Steps to reproduce

Expected behaviour

SurrealDB version

Is there an existing issue for this?

Code of Conduct

ioannist commented Feb 23, 2024

sgirones commented Feb 23, 2024

manelmontilla commented Apr 16, 2024

pashinin commented Feb 23, 2024 •

edited