Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent PANIC when leader is None #450

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cre4ture
Copy link

@cre4ture cre4ture commented May 22, 2024

Hello,

I'm currently experimenting with TiFS (https://github.com/Hexilee/tifs).
When copying large files, the TiKV cluster that I have locally gets under a constant load for a longer period of time (~1-2 hours).
During these tests I came accross a panic caused by the rust-client. I located the code responsible for this, and implemented a fix.
Please have a look, tell me what you think and what to adapt to let it get into the main branch.

Thanks and best regards,
cre4ture

log entry of the panic:

thread 'tokio-runtime-worker' panicked at /home/uli/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tikv-client-0.3.0/src/pd/cluster.rs:270:47: called `Option::unwrap()` on a `None` value

happend in tests with tifs:
thread 'tokio-runtime-worker' panicked at /home/uli/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tikv-client-0.3.0/src/pd/cluster.rs:270:47:
called `Option::unwrap()` on a `None` value

Signed-off-by: Ulrich Hornung <hornunguli@gmx.de>
Copy link
Collaborator

@pingyu pingyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic seems to happen only when PD return GetMembersResponse without leader field. But I think it should not happen as only the PD leader can return a successful response of get members request.

So I think the root cause would be, we actually get a GetMemebersResponse with error (see GrpcServer.GetMembers). We need the check the header.error at the first place.

@@ -91,7 +91,9 @@ pub enum Error {
RegionNotFoundInResponse { region_id: u64 },
/// No leader is found for the given id.
#[error("Leader of region {} is not found", region_id)]
LeaderNotFound { region_id: u64 },
LeaderOfRegionNotFound { region_id: u64 },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not to change this error as some existed user codes will be using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no concept of "Leader of cluster", but "region leader" or "PD leader". so it would be a little confused.

At the same time, it seems that it is not very necessary to add a new error enum, unless you want to handle this specified error. I think use InternalError the same as existed code of PD would be enough.

@pingyu
Copy link
Collaborator

pingyu commented Jun 2, 2024

The issue of handling get_members of PD was fixed by #452. We should not encounter the situation that leader is None anymore.

@cre4ture Could you help with some testing to confirm that this issue has been completely resolved ? Thank you ~

@cre4ture
Copy link
Author

cre4ture commented Jun 4, 2024

The issue of handling get_members of PD was fixed by #452. We should not encounter the situation that leader is None anymore.

@cre4ture Could you help with some testing to confirm that this issue has been completely resolved ? Thank you ~

I will test and report in case I still see it. Thanks for notification.

I assume this PR would be obsolete then? Or should I prepare still some parts of it for merging?

@pingyu
Copy link
Collaborator

pingyu commented Jun 4, 2024

The issue of handling get_members of PD was fixed by #452. We should not encounter the situation that leader is None anymore.
@cre4ture Could you help with some testing to confirm that this issue has been completely resolved ? Thank you ~

I will test and report in case I still see it. Thanks for notification.

I assume this PR would be obsolete then? Or should I prepare still some parts of it for merging?

It is obsolete now. Unless it's not the reason of get_members in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants