Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes too long time for a cluster to recover from leader crash #2866

Closed
kikimo opened this issue Sep 15, 2021 · 6 comments
Closed

Takes too long time for a cluster to recover from leader crash #2866

kikimo opened this issue Sep 15, 2021 · 6 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Sep 15, 2021

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

Takes very long time for cluster to recover from leader crash.

Your Environments (must be provided)

How To Reproduce(must be provided)

Steps to reproduce the behavior:

  1. create a cluster with 3storage + 3graph + 1meta, a space with 3partition + 3replicas
  2. keep inserting edges
  3. send sigstop to one leader(you can check part leader by typiing show hosts in nebula-console)

Despite the fact that nebula storage elect a new leader very quickly, it takes nearly 5min for the cluster to get back to normal.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

@kikimo kikimo added the type/bug Type: something is unexpected label Sep 15, 2021
@kikimo kikimo added this to the v2.6.0 milestone Sep 15, 2021
@kikimo kikimo changed the title takes too long time for cluster to recover from leader crash takes too long time for s cluster to recover from leader crash Sep 15, 2021
@kikimo kikimo changed the title takes too long time for s cluster to recover from leader crash Takes too long time for s cluster to recover from leader crash Sep 15, 2021
@kikimo kikimo changed the title Takes too long time for s cluster to recover from leader crash Takes too long time for a cluster to recover from leader crash Sep 16, 2021
@liuyu85cn
Copy link
Contributor

In one reproduce, raft take 15min do a rollback, and graph client looks stuck.

@Sophie-Xie
Copy link
Contributor

In one reproduce, raft take 15min do a rollback, and graph client looks stuck.

@Aiee Go client, pls check it.

@Sophie-Xie
Copy link
Contributor

raft #2903

@Sophie-Xie Sophie-Xie added the need to discuss Solution: issue or PR without a clear conclusion on whether to handle it label Oct 8, 2021
@Sophie-Xie Sophie-Xie assigned CPWstatic and unassigned critical27 and Aiee Oct 8, 2021
@Sophie-Xie Sophie-Xie removed the need to discuss Solution: issue or PR without a clear conclusion on whether to handle it label Oct 9, 2021
@critical27
Copy link
Contributor

This could a be quite complicated case, frankly speaking, I have no idea what will happen when sigstop is sent to thrift server, this may involved system futex. (raft take 15 min do a rollback, clearly impossible to blame rollback).

We could discover later.

@critical27 critical27 modified the milestones: v2.6.0, v2.7.0 Oct 12, 2021
@kikimo
Copy link
Contributor Author

kikimo commented Oct 12, 2021

This could a be quite complicated case, frankly speaking, I have no idea what will happen when sigstop is sent to thrift server, this may involved system futex. (raft take 15 min do a rollback, clearly impossible to blame rollback).

We could discover later.

The effect of SIGSTOP is just stop scheduling this process from running on cpu temporarily, exactly same as when you run gdb and attach to the process, if the process's invoke of futex() cause block, I'm 100% sure that's not the problem of SIGSTOP. I'll add more detail later.

@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Oct 15, 2021
@kikimo
Copy link
Contributor Author

kikimo commented Dec 23, 2021

closed by #3435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

6 participants