Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All storage instances crashed after 7 hours of pressure test #3373

Closed
kikimo opened this issue Nov 29, 2021 · 6 comments · Fixed by #3553
Closed

All storage instances crashed after 7 hours of pressure test #3373

kikimo opened this issue Nov 29, 2021 · 6 comments · Fixed by #3553
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Nov 29, 2021

Please check the FAQ documentation before raising an issue

Describe the bug (required)

A nebula cluster of 3storages + 1graph + 1meta, we keep inserting edge and trigger leader, after running for about 7 hours, all storage instances crash almost at the same time and the all have similar crash stack:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `/data/src/wwl/nebula/build/bin/nebula-storaged --flagfile /data/src/wwl/test/et'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000002fb8260 in folly::ThreadLocalPtr<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0) at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:153
153	    return static_cast<T*>(w.ptr);
[Current thread is 1 (Thread 0x7fbc115ff700 (LWP 3293767))]
(gdb) bt
#0  0x0000000002fb8260 in folly::ThreadLocalPtr<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:153
#1  0x0000000002f2ffe4 in folly::ThreadLocal<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:69
#2  folly::ThreadLocal<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::operator* (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:78
#3  0x0000000002eed374 in folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::getWrapper ()
    at /data/src/wwl/nebula/build/third-party/install/include/folly/SingletonThreadLocal.h:147
#4  0x0000000002eed388 in folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::LocalLifetime::~LocalLifetime (this=0x7fbc115fee00, __in_chrg=<optimized out>)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/SingletonThreadLocal.h:119
#5  0x0000000004b4c906 in (anonymous namespace)::run(void*) ()
#6  0x00007fbd20bd8ca2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#7  0x00007fbd20bd8eb3 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fbd209019fd in clone () from /lib64/libc.so.6
(gdb)

Your Environments (required)

  • OS: CentOS Linux release 7.9.2009 (Core) —— 5.4.151-1.el7.elrepo.x86_64
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8) c6d1046

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

@kikimo kikimo added the type/bug Type: something is unexpected label Nov 29, 2021
@cangfengzhs
Copy link
Contributor

Did graph and meta crash?

@kikimo
Copy link
Contributor Author

kikimo commented Nov 29, 2021

Did graph and meta crash?

no, they are running well.

@cangfengzhs
Copy link
Contributor

There are thread unsafe operations (see below), but I think it should not be the cause of crash

threadLocalInfo.localCache_[spaceId] = infoDeepCopy; // infoDeepCopy is a shared_ptr

@critical27
Copy link
Contributor

critical27 commented Nov 29, 2021

perhaps related to #3192, there is a hidden bug in MetaClient. The leader change frequently would make meta version updated, and meta client will pull data from meta server in consequence

@cangfengzhs
Copy link
Contributor

facebook/folly#1252 I find a similar bug in folly repo.

@cangfengzhs
Copy link
Contributor

The current guess is that the NULL pointer is not checked during OOM, which causes the memory near the 0x00 address to be modified, which causes the program to crash when the destructor is called at stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants