Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: vinyl/errinj.test.lua fails under high load #4169

Closed
avtikhon opened this issue Apr 17, 2019 · 0 comments
Closed

test: vinyl/errinj.test.lua fails under high load #4169

avtikhon opened this issue Apr 17, 2019 · 0 comments
Assignees
Labels
flaky test qa Issues related to tests or testing subsystem

Comments

@avtikhon
Copy link
Contributor

avtikhon commented Apr 17, 2019

Tarantool version:
All

OS version:
All

Bug description:
[009] --- vinyl/errinj.result Tue Apr 16 16:43:36 2019
[009] +++ vinyl/errinj.reject Wed Apr 17 09:42:36 2019
[009] @@ -530,7 +530,7 @@
[009] ...
[009] t2 - t1 < 1
[009] ---
[009] -- true
[009] +- false
[009] ...
[009] test_run:cmd("cleanup server test")
[009] ---
[009]

Checked under highload the delay and found that it is longer than checked 1 second and in resl near to 1.5 sec:
[029] --- vinyl/errinj.result Wed Apr 17 10:18:47 2019
[029] +++ vinyl/errinj.reject Wed Apr 17 10:20:18 2019
[029] @@ -528,9 +528,9 @@
[029] t2 = fiber.time()
[029] ---
[029] ...
[029] -t2 - t1 < 1
[029] ----
[029] -- true
[029] +t2 - t1 < 1 or t2 - t1
[029] +---
[029] +- 1.6016597747803
[029] ...
[029] test_run:cmd("cleanup server test")
[029] ---
[029]

in 100 parallel runs the failed delays were:
[002] +- 1.4104716777802
[022] +- 1.3933029174805
[044] +- 1.4296517372131
[033] +- 1.6380662918091
[001] +- 1.9799520969391
[027] +- 1.7067711353302
[043] +- 1.3778221607208
[034] +- 1.3820221424103
[032] +- 1.3820221424103
[020] +- 1.6275615692139
[050] +- 1.6275615692139
[048] +- 1.1880359649658

Suggested fix is to increase the awaiting delay up to 3 or even 5 secs.

Test "check that all dump/compaction tasks that are in progress at
the time when the server stops are aborted immediately.":

s = box.schema.space.create('test', {engine = 'vinyl'})
_ = s:create_index('i1', {parts = {1, 'unsigned'}})
_ = s:create_index('i2', {parts = {2, 'unsigned'}})
box.error.injection.set('ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT', 0.01)
for i = 1, 1000 do s:replace{i, i} end
_ = fiber.create(function() box.snapshot() end)
fiber.sleep(0.01)
test_run:cmd('switch default')
t1 = fiber.time()
test_run:cmd("stop server test")
t2 = fiber.time()
t2 - t1 < 1

Steps to reproduce:
./test-run.py -j 50 --force vinyl/

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat
@avtikhon avtikhon added qa Issues related to tests or testing subsystem flaky test labels Apr 17, 2019
@avtikhon avtikhon self-assigned this Apr 17, 2019
avtikhon added a commit that referenced this issue Apr 17, 2019
Test "check that all dump/compaction tasks that are in progress at
the time when the server stops are aborted immediately.", but in
real the awaiting time of 1 second is not enough due to runs in
parallel and it fails, like:

[009] --- vinyl/errinj.result Tue Apr 16 16:43:36 2019
[009] +++ vinyl/errinj.reject Wed Apr 17 09:42:36 2019
[009] @@ -530,7 +530,7 @@
[009] ...
[009] t2 - t1 < 1
[009] ---
[009] -- true
[009] +- false
[009] ...
[009] test_run:cmd("cleanup server test")
[009] ---
[009]

in 100 parallel runs the failed delays were found:

[002] +- 1.4104716777802
[022] +- 1.3933029174805
[044] +- 1.4296517372131
[033] +- 1.6380662918091
[001] +- 1.9799520969391
[027] +- 1.7067711353302
[043] +- 1.3778221607208
[034] +- 1.3820221424103
[032] +- 1.3820221424103
[020] +- 1.6275615692139
[050] +- 1.6275615692139
[048] +- 1.1880359649658

Desided to increase the awiting time up to 3 seconds.

Close #4169
avtikhon added a commit that referenced this issue Apr 19, 2019
Test "check that all dump/compaction tasks that are in progress at
the time when the server stops are aborted immediately.", but in
real the awaiting time of 1 second is not enough due to runs in
parallel and it fails, like:

[009] --- vinyl/errinj.result Tue Apr 16 16:43:36 2019
[009] +++ vinyl/errinj.reject Wed Apr 17 09:42:36 2019
[009] @@ -530,7 +530,7 @@
[009] ...
[009] t2 - t1 < 1
[009] ---
[009] -- true
[009] +- false
[009] ...
[009] test_run:cmd("cleanup server test")
[009] ---
[009]

in 100 parallel runs the failed delays were found:

[002] +- 1.4104716777802
[022] +- 1.3933029174805
[044] +- 1.4296517372131
[033] +- 1.6380662918091
[001] +- 1.9799520969391
[027] +- 1.7067711353302
[043] +- 1.3778221607208
[034] +- 1.3820221424103
[032] +- 1.3820221424103
[020] +- 1.6275615692139
[050] +- 1.6275615692139
[048] +- 1.1880359649658

Decided to avoid of use the time check at all and change the
ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT to ERRINJ_VY_DUMP_DELAY injection.
In this way the time checks were completely removed.

Next issue met was the error:

vy_quota.c:298 !> SystemError Failed to allocate 2097240 bytes
    in lsregion for vinyl transaction: Cannot allocate memory

That is why the merged 2 subtests were divided into 2 standalone
subtests to be able to set the memory limit of the 2nd subtest to
2097240 value.

Close #4169
@locker locker closed this as completed in 8c3bd9d Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test qa Issues related to tests or testing subsystem
Projects
None yet
Development

No branches or pull requests

1 participant