Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra parts found after performing balance operation #3657

Closed
kikimo opened this issue Jan 7, 2022 · 1 comment · Fixed by #3668
Closed

Extra parts found after performing balance operation #3657

kikimo opened this issue Jan 7, 2022 · 1 comment · Fixed by #3668
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Jan 7, 2022

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Extra parts found after performing balance operation. We have a space with 512part * 3replicas = 1536parts:

(root@nebula) [ttos_3p3r]> show hosts;                                                                                                              [529/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:206"        |
| "store2" | 9779 | "ONLINE"  | 24           | "ttos_3p3r:24"       | "ttos_3p3r:340"        |
| "store3" | 9779 | "ONLINE"  | 303          | "ttos_3p3r:303"      | "ttos_3p3r:319"        |
| "store4" | 9779 | "ONLINE"  | 99           | "ttos_3p3r:99"       | "ttos_3p3r:329"        |
| "store5" | 9779 | "ONLINE"  | 86           | "ttos_3p3r:86"       | "ttos_3p3r:342"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1536"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7345/7948 us)

Thu, 06 Jan 2022 15:33:25 CST

after some balance operation we found that 1536parts grows into 1537parts and 1538parts:

(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:175"        |
| "store2" | 9779 | "ONLINE"  | 81           | "ttos_3p3r:81"       | "ttos_3p3r:342"        |
| "store3" | 9779 | "ONLINE"  | 31           | "ttos_3p3r:31"       | "ttos_3p3r:330"        |
| "store4" | 9779 | "ONLINE"  | 307          | "ttos_3p3r:307"      | "ttos_3p3r:336"        |
| "store5" | 9779 | "ONLINE"  | 93           | "ttos_3p3r:93"       | "ttos_3p3r:354"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1537"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 6975/7626 us)

Thu, 06 Jan 2022 16:24:45 CST
(root@nebula) [ttos_3p3r]> show hosts;                                                                                                              [259/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:153"        |
| "store2" | 9779 | "ONLINE"  | 114          | "ttos_3p3r:114"      | "ttos_3p3r:349"        |
| "store3" | 9779 | "ONLINE"  | 3            | "ttos_3p3r:3"        | "ttos_3p3r:335"        |
| "store4" | 9779 | "ONLINE"  | 275          | "ttos_3p3r:275"      | "ttos_3p3r:342"        |
| "store5" | 9779 | "ONLINE"  | 120          | "ttos_3p3r:120"      | "ttos_3p3r:359"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1538"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7809/8397 us)

Thu, 06 Jan 2022 16:39:33 CST

and we can also found that two storages are offline later:

(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:120"        |
| "store2" | 9779 | "ONLINE"  | 24           | "ttos_3p3r:24"       | "ttos_3p3r:361"        |
| "store3" | 9779 | "ONLINE"  | 321          | "ttos_3p3r:321"      | "ttos_3p3r:347"        |
| "store4" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:354"        |
| "store5" | 9779 | "ONLINE"  | 128          | "ttos_3p3r:128"      | "ttos_3p3r:356"        |
| "Total"  |      |           | 473          | "ttos_3p3r:473"      | "ttos_3p3r:1538"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 11798/12566 us)

Thu, 06 Jan 2022 16:58:36 CST

and we found that the two offline storaged are still running with very high CPU, store1 CPU usage:

top - 11:31:08 up 8 days, 23:16,  1 user,  load average: 112.18, 108.17, 107.48
Tasks:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 58.9 us, 15.0 sy,  0.0 ni, 23.9 id,  0.0 wa,  0.0 hi,  2.2 si,  0.0 st
MiB Mem : 257583.7 total,  88205.3 free,  76023.6 used,  93354.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 179741.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     14 root      20   0 9514504 109448  36152 R  3801   0.0  48245:06 nebula-storaged
      1 root      20   0    1104      4      0 S   0.0   0.0   0:01.07 docker-init
      7 root      20   0    3976   2960   2716 S   0.0   0.0   0:00.00 run-storage.sh
     12 root      20   0   12176   4328   3408 S   0.0   0.0   0:00.00 sshd
     15 root      20   0    2508    520    460 S   0.0   0.0   0:00.00 sleep
   5602 root      20   0   13580   8868   7452 S   0.0   0.0   0:00.01 sshd
   5613 root      20   0    4240   3444   2976 S   0.0   0.0   0:00.00 bash
   5618 root      20   0    6180   3296   2752 R   0.0   0.0   0:00.00 top

store2 CPU usage:

top - 11:33:10 up 8 days, 23:18,  1 user,  load average: 108.27, 107.95, 107.48
Tasks:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 61.4 us, 16.6 sy,  0.0 ni, 19.5 id,  0.0 wa,  0.0 hi,  2.5 si,  0.0 st
MiB Mem : 257583.7 total,  88130.5 free,  76069.1 used,  93384.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 179696.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     14 root      20   0 9795.5m 596444  36496 R  3586   0.2  39474:19 nebula-storaged
      1 root      20   0    1104      4      0 S   0.0   0.0   0:01.08 docker-init
      7 root      20   0    3976   2996   2748 S   0.0   0.0   0:00.00 run-storage.sh
     12 root      20   0   12176   4184   3264 S   0.0   0.0   0:00.00 sshd
     15 root      20   0    2508    588    524 S   0.0   0.0   0:00.00 sleep
   6604 root      20   0   13584   8892   7472 S   0.0   0.0   0:00.02 sshd
   6615 root      20   0    4240   3412   2952 S   0.0   0.0   0:00.00 bash
   6620 root      20   0    6180   3280   2732 R   0.0   0.0   0:00.00 top

pstack shows that a lot of thread within the storage are acquiring RWSpinlock:

root@store4:~# cat new.txt | grep -i spinlock
#3  0x000000000353ae1b in folly::RWSpinLock::lock_shared()+74 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:212
#4  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  0x000000000353af9d in folly::RWSpinLock::try_lock_shared()+107 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:281
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210

we failed to enable storage log(we disable storage log when starting), the stderr.txt shows:

Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045600.14!Could not create logging file: Too many open files

and the storage failed to exit after recieving SIGTERM, stack dump before issueing SIGTERM:

before.txt

stack dump after issueing SIGTERM:

after.txt

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id 1147e89

How To Reproduce(required)

Steps to reproduce the behavior:

  1. create a cluster with 3 storages
  2. create a space with 512parts * 3 replicas in the 3 storages
  3. add two new storges, balance data between all these 5storages, wait until the job finish
  4. balance remove one storage

and we find that the storage cluster go insane duiring the last step.

Expected behavior

Additional context

@liwenhui-soul
Copy link
Contributor

fix in #3668

@Sophie-Xie Sophie-Xie linked a pull request Jan 17, 2022 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants