Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Election timers can lead to resource exhaustion : Could not create thread: Resource temporarily unavailable #1216

Closed
amitanandaiyer opened this issue Apr 17, 2019 · 3 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@amitanandaiyer
Copy link
Contributor

Seeing this in one of the workloads. Once a node is unable to hear from its peers, it starts firing the election timeout, and it turns out that we create a new thread for each such election timeout. (since we do not want to run in the timer thread.)

However the thread pool that raft token uses is unlimited in sizes, so if things don't catch up as fast we could run into resource exhaustion.

https://github.com/YugaByte/yugabyte-db/blob/6262fc7a7e23c2533441b10be774a715d28f804f/src/yb/tserver/ts_tablet_manager.cc#L317

I0414 03:50:05.447449 15811 raft_consensus.cc:803] T c6d4a0a0d0e04e4db7a07ea5e0b2cff0 P 07c4d6051edb487c85276c4309e7793b [term 34 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:50:05.452247 25149 raft_consensus.cc:803] T da25186fd14545868286ea6b205383b3 P 07c4d6051edb487c85276c4309e7793b [term 20 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:50:05.453394 25150 raft_consensus.cc:803] T 0b54d0832741467ba6021d16ed1a33f0 P 07c4d6051edb487c85276c4309e7793b [term 22 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:50:05.464104 16404 leader_election.cc:215] T a873e7f151d7436b91a04e7ceb085467 P 07c4d6051edb487c85276c4309e7793b [CANDIDATE]: Term 12 pre-election: Requesting vote from peer 7f1eb600612044bfa88cfda14f67fad3
I0414 03:50:05.464107 16886 raft_consensus.cc:481] T a873e7f151d7436b91a04e7ceb085467 P 07c4d6051edb487c85276c4309e7793b [term 11 FOLLOWER]: Fail of leader 7f1eb600612044bfa88cfda14f67fad3 detected. Triggering leader pre-election, mode=NORMAL_ELECTION
W0414 03:50:05.464138 473 threadpool.cc:482] Thread pool failed to create thread: Runtime error (yb/util/thread.cc:586): Could not create thread: Resource temporarily unavailable (error 11)

@amitanandaiyer
Copy link
Contributor Author

seems to happen more often after pre-elections.

with pre-elections we don't seem to back-off exponentially, but try a new pre-election every 3sec.

Probably one option would be to add a back off scheme similar to normal elections.

I0414 03:45:19.108985 24452 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:22.446076 24983 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:28.269204 25623 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:31.275357 25981 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:34.377466 26361 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:39.505592 27011 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:42.740705 27398 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:46.403831 27868 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:55.885215 28894 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:45:58.908330 29205 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:01.929430 29594 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:04.852550 29955 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:09.069561 30426 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:13.860560 30924 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:17.336658 31312 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:21.066762 31751 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:24.957352 32207 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:28.675365 32670 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:33.535610 944 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:31.240491 550 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:38.898849 1712 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:54.981504 2169 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:46:58.250499 2385 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:01.245615 2902 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:04.882721 3451 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:07.174827 3792 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:11.086655 4335 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:15.454869 4954 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:18.050858 5312 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:20.745961 5557 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:24.225071 5984 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:28.647373 6580 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:37.609266 7764 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0414 03:47:39.884393 8052 raft_consensus.cc:803] T 6388fff4f85d4227b6de5f15dd2154f4 P 07c4d6051edb487c85276c4309e7793b [term 24 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...

@amitanandaiyer
Copy link
Contributor Author

In this case, over a small period of 6 mins from 03:45 to 03:51, the total started threads increased from 88k to 120k.

The number of running threads went from ~850 to 11k; before running out of resources to create new threads -- resulting in a FATAL.

@bmatican bmatican added the area/docdb YugabyteDB core features label Apr 20, 2019
@bmatican bmatican added this to To Do in YBase features via automation Apr 20, 2019
@amitanandaiyer
Copy link
Contributor Author

addressed in
c6f65f8

YBase features automation moved this from To Do to Done Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
Development

No branches or pull requests

2 participants