Skip to content

Conversation

olegrok
Copy link

@olegrok olegrok commented Oct 30, 2021

Before this patch it was assumed that between event queues
destroying and producer/consumer close there were no any event.
However it's not so and tests catched it:

Process 46864 stopped
* thread #4, name = 'coio', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fff203fed86 libsystem_pthread.dylib`pthread_mutex_lock + 4
libsystem_pthread.dylib`pthread_mutex_lock:
->  0x7fff203fed86 <+4>:  cmpq   $0x4d55545a, (%rdi)       ; imm = 0x4D55545A
    0x7fff203fed8d <+11>: jne    0x7fff203fee02            ; <+128>
    0x7fff203fed8f <+13>: movl   0xc(%rdi), %eax
    0x7fff203fed92 <+16>: movl   %eax, %ecx
Target 0: (tarantool) stopped.
(lldb) bt
* thread #4, name = 'coio', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00007fff203fed86 libsystem_pthread.dylib`pthread_mutex_lock + 4
    frame #1: 0x0000000102fdbba5 tntkafka.dylib`queue_push(queue=0x6d726554203a5d70, value=0x0000000102c1e0c0) at queue.c:92:5 [opt]
    frame #2: 0x0000000102fd7f9a tntkafka.dylib`log_callback(rd_kafka=<unavailable>, level=7, fac="DESTROY", buf=<unavailable>) at callbacks.c:56:17 [opt]
    frame #3: 0x000000011e5456f3 librdkafka.1.dylib`rd_kafka_log0 + 563
    frame #4: 0x000000011e5469a0 librdkafka.1.dylib`rd_kafka_destroy_app + 832
    frame #5: 0x0000000102fdb3be tntkafka.dylib`wait_producer_destroy(args=<unavailable>) at producer.c:414:5 [opt]
    frame #6: 0x000000010015d81e tarantool`coio_on_call + 23
    frame #7: 0x00000001003019a7 tarantool`etp_proc + 395
    frame #8: 0x00007fff204038fc libsystem_pthread.dylib`_pthread_start + 224
    frame #9: 0x00007fff203ff443 libsystem_pthread.dylib`thread_start + 15

To fix this issue let's reoreder code a bit and start to close
producer/consumer before event queues destruction. This patch
seems to help to solve an issue. Also I'd want to highlight that
not all destructors should be placed after wait_producer_destroy/wait_consumer_destroy
it seems topics they owns should be destroyed before. Otherwise it
leads to an assertion.

Also small code changes were performed to adjust code to Tarantool
code-style and to make consumer_destroy and producer_destroy more
consistent in sence of return values.

Closes #44

Before this patch it was assumed that between event queues
destroying and producer/consumer close there were no any event.
However it's not so and tests catched it:

```
Process 46864 stopped
* thread #4, name = 'coio', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fff203fed86 libsystem_pthread.dylib`pthread_mutex_lock + 4
libsystem_pthread.dylib`pthread_mutex_lock:
->  0x7fff203fed86 <+4>:  cmpq   $0x4d55545a, (%rdi)       ; imm = 0x4D55545A
    0x7fff203fed8d <+11>: jne    0x7fff203fee02            ; <+128>
    0x7fff203fed8f <+13>: movl   0xc(%rdi), %eax
    0x7fff203fed92 <+16>: movl   %eax, %ecx
Target 0: (tarantool) stopped.
(lldb) bt
* thread #4, name = 'coio', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00007fff203fed86 libsystem_pthread.dylib`pthread_mutex_lock + 4
    frame #1: 0x0000000102fdbba5 tntkafka.dylib`queue_push(queue=0x6d726554203a5d70, value=0x0000000102c1e0c0) at queue.c:92:5 [opt]
    frame #2: 0x0000000102fd7f9a tntkafka.dylib`log_callback(rd_kafka=<unavailable>, level=7, fac="DESTROY", buf=<unavailable>) at callbacks.c:56:17 [opt]
    frame #3: 0x000000011e5456f3 librdkafka.1.dylib`rd_kafka_log0 + 563
    frame #4: 0x000000011e5469a0 librdkafka.1.dylib`rd_kafka_destroy_app + 832
    frame #5: 0x0000000102fdb3be tntkafka.dylib`wait_producer_destroy(args=<unavailable>) at producer.c:414:5 [opt]
    frame #6: 0x000000010015d81e tarantool`coio_on_call + 23
    frame #7: 0x00000001003019a7 tarantool`etp_proc + 395
    frame #8: 0x00007fff204038fc libsystem_pthread.dylib`_pthread_start + 224
    frame #9: 0x00007fff203ff443 libsystem_pthread.dylib`thread_start + 15
```

To fix this issue let's reoreder code a bit and start to close
producer/consumer before event queues destruction. This patch
seems to help to solve an issue. Also I'd want to highlight that
not all destructors should be placed after wait_producer_destroy/wait_consumer_destroy
it seems topics they owns should be destroyed before. Otherwise it
leads to an assertion.

Also small code changes were performed to adjust code to Tarantool
code-style and to make consumer_destroy and producer_destroy more
consistent in sence of return values.

Closes #44
@olegrok olegrok requested a review from Totktonada October 30, 2021 00:36
@Totktonada
Copy link

I'm not really in context here. I understand the logic of the change (and it looks correct), but actually I unable to verify it. The project is in @sharonovd responsibility, I would ask him to assign a reviewer.

If I would review it, I would look at the following points:

  • How Lua objects and C structures lifetimes are linked: reference counting + __gc or immediate freeing + flag to decline requests?
  • Whether it is possible to start two destruction calls in parallel somehow?
  • How it appears that the problem was not catched before: there were no attempts to use objects that're in a destruction process (so and event is received after coio_call())?
  • Whether it is possible to write a test with good chance to reproduce this problem and similar ones: say, create — give workload — destroy — stop workload in a loop with many iterations. A kind of stress test for the destruction process. replication/box_set_replication_stress.test.lua in tarantool, which is written this way, catched three problems AFAIR.

@Totktonada Totktonada removed their request for review October 30, 2021 18:40
@olegrok olegrok requested a review from amitichev November 1, 2021 17:00
@filonenko-mikhail filonenko-mikhail merged commit 49d5811 into master Nov 9, 2021
@filonenko-mikhail filonenko-mikhail deleted the 44-fix-use-after-free branch November 9, 2021 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EXC_BAD_ACCESS on test_producer.py

3 participants