Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed. #3510

Closed
rosik opened this issue Jul 6, 2018 · 7 comments
Closed

Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed. #3510

rosik opened this issue Jul 6, 2018 · 7 comments
Assignees
Labels
bug Something isn't working crash replication
Milestone

Comments

@rosik
Copy link
Contributor

rosik commented Jul 6, 2018

I've seen two errors today:

tarantool: /opt/tntsrc/src/box/replication.cc:234: void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.
Aborted
tarantool: /opt/tntsrc/src/box/applier.cc:748: void applier_pause(applier*): Assertion `fiber() == applier->reader' failed.                                                           
Aborted

Tarantool version is 1.10.1-135-g193ef4150

It does not reproduce every time, and I can't give a minimal reproducible example, but
my workflow is to join several replicas (3rd or 4th) to vshard storage replicaset.

I also attact logs from all instances:
logs.zip

@locker
Copy link
Member

locker commented Jul 17, 2018

Related crash:

src/box/replication.cc:342: void replica_on_applier_disconnect(replica*): Assertion `0' failed.

@locker locker added bug Something isn't working replication and removed iproto labels Jul 17, 2018
@locker
Copy link
Member

locker commented Jul 17, 2018

The issue can be reproduced with the following script:

box.cfg{
    log_level = 4,
    listen = 44440 + arg[1],
    replication = {44441, 44442}
}
if box.info.id == 1 then
    box.schema.user.grant('guest', 'replication')
    box.space._cluster:delete(2)
end
os.exit(0)

Run it from two terminals:

  • Terminal 1: tarantool reproduce.lua 1
  • Terminal 2: tarantool reproduce.lua 2

One of the instances will crash:

2018-07-17 19:37:57.914 [19674] main/101/test.lua C> Tarantool 1.9.1-52-g1f187cacaabb
2018-07-17 19:37:57.914 [19674] main/101/test.lua C> log level 4
2018-07-17 19:37:57.926 [19674] main/105/applier/ xrow.c:792 E> ER_LOADING: Instance bootstrap hasn't finished yet
2018-07-17 19:37:58.928 [19674] main/105/applier/ coio.cc:104 !> SystemError connect, called on fd 12, aka [::1]:33652: Connection refused
tarantool: /home/vlad/src/tarantool-1.9/src/box/replication.cc:334: void replica_on_applier_disconnect(replica*): Assertion `0' failed.
Aborted (core dumped)

@sergepetrenko sergepetrenko self-assigned this Jul 18, 2018
@sergepetrenko
Copy link
Collaborator

@rosik Hi

tarantool: /opt/tntsrc/src/box/applier.cc:748: void applier_pause(applier*): Assertion `fiber() == applier->reader' failed.                                                           
Aborted

Have you seen this assertion fail lately? What I tried is launching vshard example from the docs multiple times, but couldn't get such an error.
Maybe you can give me some of the more recent examples of this crash?

sergepetrenko added a commit that referenced this issue Aug 2, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case.

Part of #3510
sergepetrenko added a commit that referenced this issue Aug 6, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
@locker
Copy link
Member

locker commented Aug 7, 2018

This crash

tarantool: /opt/tntsrc/src/box/applier.cc:748: void applier_pause(applier*): Assertion `fiber() == applier->reader' failed.

was fixed in the scope of #3606.

@locker
Copy link
Member

locker commented Aug 8, 2018

The other assertion

tarantool: /opt/tntsrc/src/box/replication.cc:234: void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.

was caught here: #3610 (comment)

So I guess we can close this issue as soon as the pending patch gets committed.

sergepetrenko added a commit that referenced this issue Aug 8, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
@sergepetrenko
Copy link
Collaborator

Tests on the latest commit hang due to a bug in test-run: tarantool/test-run#109

sergepetrenko added a commit that referenced this issue Aug 16, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
sergepetrenko added a commit that referenced this issue Aug 17, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
sergepetrenko added a commit that referenced this issue Aug 17, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
locker pushed a commit that referenced this issue Aug 17, 2018
One possible case when two applier errors happen one after another
wasn't handled in replica_on_applier_disconnect(), which lead to
occasional test failures and crashes. Handle this case and add a
regression test.

Part of #3510
@locker
Copy link
Member

locker commented Aug 17, 2018

Fixed by c939ca8

@locker locker closed this as completed Aug 17, 2018
@kyukhin kyukhin added the tmp label Oct 9, 2018
@locker locker removed the tmp label Nov 28, 2018
avtikhon added a commit that referenced this issue Sep 6, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-assert-on-server-die.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-value-not-replicated-on-iproto-request.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-no-panic-on-connected.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 7, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-assert-on-server-die.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-value-not-replicated-on-iproto-request.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-no-panic-on-connected.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit c7e6627)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crash replication
Projects
None yet
Development

No branches or pull requests

5 participants