Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/dll.test.lua fails sporadically #4427

Open
Korablev77 opened this issue Aug 14, 2019 · 6 comments

Comments

@Korablev77
Copy link
Collaborator

commented Aug 14, 2019

Tarantool version:
master branch; default debug build (cmake .; make -j)
OS version:
macOS Mojave
Bug description:

[004] sql/savepoints.test.lua                         vinyl           [ pass ]
[007] sql/misc.test.lua                               memtx           [ pass ]
[007] sql/no-pk-space.test.lua                        vinyl           [ pass ]
[006] sql/row-count.test.lua                          vinyl           [ pass ]
[006] sql/delete.test.lua                             memtx           [ pass ]
[002] sql/vinyl-opts.test.lua                         vinyl           [ pass ]
[002] sql/select-null.test.lua                        memtx           [ pass ]
[003] sql/ddl.test.lua                                memtx           
[003] 
[003] [Instance "app" killed by signal: 6 (SIGABRT)]
[003] 
[003] Last 15 lines of Tarantool Log file [Instance "app"][/Users/n.pettik/tarantool/test/var/003_sql/app.log]:
[003] 2019-08-14 17:42:50.542 [55986] main/102/app I> assigned id 1 to replica 6803180e-3619-477a-8ed9-4fd412338d7d
[003] 2019-08-14 17:42:50.542 [55986] main/102/app I> cluster uuid 7d95969f-104c-4843-a3b2-56442dcb167a
[003] 2019-08-14 17:42:50.545 [55986] snapshot/101/main I> saving snapshot `/Users/n.pettik/tarantool/test/var/003_sql/app/00000000000000000000.snap.inprogress'
[003] 2019-08-14 17:42:50.546 [55986] snapshot/101/main I> done
[003] 2019-08-14 17:42:50.547 [55986] main/102/app I> ready to accept requests
[003] 2019-08-14 17:42:50.548 [55986] main/105/checkpoint_daemon I> scheduled next checkpoint for Wed Aug 14 19:07:47 2019
[003] 2019-08-14 17:42:50.548 [55986] main/102/app I> set 'listen' configuration option to "\/Users\/n.pettik\/tarantool\/test\/var\/003_sql\/app.socket-iproto"
[003] 2019-08-14 17:42:50.548 [55986] main/102/app I> set 'vinyl_max_tuple_size' configuration option to 5242880
[003] 2019-08-14 17:42:50.548 [55986] main/102/app I> set 'memtx_max_tuple_size' configuration option to 5242880
[003] 2019-08-14 17:42:50.548 [55986] main/116/console/unix/:/Users/n.pettik/ta I> started
[003] 2019-08-14 17:42:50.549 [55986] main C> entering the event loop
[003] 2019-08-14 17:42:50.852 [55986] main/105/checkpoint_daemon I> scheduled next checkpoint for Wed Aug 14 18:42:50 2019
[003] 2019-08-14 17:42:50.853 [55986] snapshot/101/main I> saving snapshot `/Users/n.pettik/tarantool/test/var/003_sql/app/00000000000000000020.snap.inprogress'
[003] 2019-08-14 17:42:50.853 [55986] snapshot/101/main I> done
[003] Assertion failed: (itype2irt(tv) == ((IRType)(((tr)>>24) & IRT_TYPE))), function rec_check_slots, file lj_record.c, line 137.

Backtrace is:

  frame #4: 0x0000000106bb7276 tarantool`rec_check_slots(J=0x0000000109215698) at lj_record.c:137
    frame #5: 0x0000000106bb317c tarantool`lj_record_ins(J=0x0000000109215698) at lj_record.c:2040
    frame #6: 0x0000000106bf30b8 tarantool`trace_state(L=0x0000000109215378, dummy=0x0000000000000000, ud=0x0000000109215698) at lj_trace.c:664
    frame #7: 0x0000000106b4d7c1 tarantool`lj_vm_cpcall + 77
    frame #8: 0x0000000106bf2c05 tarantool`lj_trace_ins(J=0x0000000109215698, pc=0x0000000109360940) at lj_trace.c:723
    frame #9: 0x0000000106b6885c tarantool`lj_dispatch_call(L=0x0000000109215378, pc=0x0000000109360944) at lj_dispatch.c:493
    frame #10: 0x0000000106b4f1e7 tarantool`lj_vm_hotcall + 38
    frame #11: 0x0000000106b7b807 tarantool`lua_call(L=0x0000000109215378, nargs=1, nresults=0) at lj_api.c:1121
    frame #12: 0x0000000106ace915 tarantool`lbox_fillspace(L=0x0000000109215378, space=0x00007f7ee5422110, i=3) at space.cc:445
    frame #13: 0x0000000106acddba tarantool`box_lua_space_new(L=0x0000000109215378, space=0x00007f7ee5422110) at space.cc:480
    frame #14: 0x0000000106acdbfd tarantool`box_lua_space_new_or_delete(trigger=0x0000000106f20c48, event=0x00007f7ee5422110) at space.cc:510
    frame #15: 0x0000000106b2ed67 tarantool`::trigger_run(list=0x0000000106f20750, event=0x00007f7ee5422110) at trigger.cc:41
    frame #16: 0x0000000106a76bc2 tarantool`space_cache_replace(old_space=0x0000000000000000, new_space=0x00007f7ee5422110) at schema.cc:252
    frame #17: 0x0000000106a6375c tarantool`on_drop_space_rollback(trigger=0x000000012d703a58, event=0x000000010ac67038) at alter.cc:1627
    frame #18: 0x0000000106b2ed67 tarantool`::trigger_run(list=0x000000012d703990, event=0x000000010ac67038) at trigger.cc:41
    frame #19: 0x0000000106a81fed tarantool`txn_run_rollback_triggers(txn=0x000000010ac67038, triggers=0x000000012d703990) at txn.c:402
    frame #20: 0x0000000106a81f7b tarantool`txn_rollback_one_stmt(txn=0x000000010ac67038, stmt=0x000000012d703948) at txn.c:137
    frame #21: 0x0000000106a80351 tarantool`txn_rollback_to_svp(txn=0x000000010ac67038, svp=0x000000010ac67400) at txn.c:148
  * frame #22: 0x0000000106a816c3 tarantool`box_txn_rollback_to_savepoint(svp=0x000000010ac67640) at txn.c:806
    frame #23: 0x0000000106b4f7ed tarantool`lj_vm_ffi_call + 132
    frame #24: 0x0000000106c03e53 tarantool`lj_ccall_func(L=0x000000010a1984c0, cd=0x000000010a2f3ac8) at lj_ccall.c:1150
    frame #25: 0x0000000106c35460 tarantool`lj_cf_ffi_meta___call(L=0x000000010a1984c0) at lib_ffi.c:230
    frame #26: 0x0000000106b4d3bd tarantool`lj_BC_FUNCC + 68
    frame #27: 0x0000000106b7bb90 tarantool`lua_pcall(L=0x000000010a1984c0, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139
    frame #28: 0x0000000106af1343 tarantool`luaT_call(L=0x000000010a1984c0, nargs=3, nreturns=-1) at utils.c:1008
    frame #29: 0x0000000106ae9bfc tarantool`lua_fiber_run_f(ap=0x0000000109400378) at fiber.c:366
    frame #30: 0x000000010698d15a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:360), ap=0x0000000109400378)(__va_list_tag*), __va_list_tag*) at fiber.h:672
    frame #31: 0x0000000106b0d01b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:694
    frame #32: 0x0000000106d71db7 tarantool`coro_init at coro.c:110

This looks quite similar to this issue (same backtrace, same assertion): #1099

@sergos

This comment has been minimized.

Copy link

commented Aug 15, 2019

It is a GC64 at first sight: MacOS has it turned ON by default.

@sergos

This comment has been minimized.

Copy link

commented Aug 15, 2019

After turning it off with -DLUAJIT_ENABLE_GC64=OFF I got the same failure.

It took ~1min of running $ time while (($? == 0)); do ./test-run.py sql/ -j 4; done

So the problem is common 64 and 32 GC machinery.

@sergos sergos self-assigned this Aug 15, 2019
@sergos

This comment has been minimized.

Copy link

commented Aug 15, 2019

single test has the same result
time while (($? == 0)); do ./test-run.py $(yes sql/ddl.test.lua | head -100) -j 4; done

@kyukhin kyukhin added this to the 2.2.2 milestone Aug 23, 2019
@mraleph

This comment has been minimized.

Copy link

commented Aug 26, 2019

I have taken a quick look at this based on @sergos request.

You are doing something really not supported by LuaJIT. You mix Lua API and FFI and you create sandwich stacks like this. Lua -FFI-> C -API-> Lua - this sort of re-entrancy is explicitly not supported by LuaJIT compiler machinery.

In this particular case you call box_txn_rollback_to_savepoint which eventually calls to lbox_fillspace which then enters Lua again.

In general JIT recorder wants to assume that if you exited Lua world through FFI you are not re-entering Lua world again. That's because JIT generated code in general expects to be leaf of the execution. There is special machinery in that tries to detect situations like Lua -FFI-> C -FFI-> Lua and blacklists functions that are causing it from compilation. Note the word "tries": only interpreter and recorder check for such situations. If you write the code which does not always re-enter then you might end up in a situation where your code just crashes once it is JIT compiled.

There is no such check on re-entering through the API - like what you are doing here. So violation of the assumption simply crashes the recorder.

As a workaround I suggest blacklisting box_txn_rollback_to_savepoint with jit.off().

As a real fix I guess LuaJIT should be patched to detect this sort of re-entrancy through the API - at least when asserts are on.

@sergos

This comment has been minimized.

Copy link

commented Aug 26, 2019

@mraleph, could it be also fixed with clearing appropriate code cache for box_txn_rollback_to_savepoint at the API re-entrancy instead?

@mraleph

This comment has been minimized.

Copy link

commented Aug 26, 2019

Yes, I think you can sprinkle the relevant parts from lj_ccallback_enter and lj_ccallback_leave in the lua_* methods. It would be as reliable as FFI re-entrancy detection (i.e. if you are re-entering non-deterministically things would still go south if you accidentally re-enter from the already compiled trace which was not aborted). Relevant parts are:

  if (tvref(g->jit_base)) {
    setstrV(L, L->top++, lj_err_str(L, LJ_ERR_FFI_BADCBACK));
    if (g->panic) g->panic(L);
    exit(EXIT_FAILURE);
  }
  lj_trace_abort(g);  /* Never record across callback. */
  cts->cb.slot = 0;  /* Blacklist C function that called the callback. */
sergos pushed a commit that referenced this issue Sep 2, 2019
fixes: #4427

follow-up: need to remove savepoints from FFI completely
@igormunkin igormunkin self-assigned this Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.