Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balance remove error #3654

Closed
HarrisChu opened this issue Jan 6, 2022 · 3 comments · Fixed by #3668
Closed

balance remove error #3654

HarrisChu opened this issue Jan 6, 2022 · 3 comments · Fixed by #3668
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@HarrisChu
Copy link
Contributor

HarrisChu commented Jan 6, 2022

just record it.

issue 1:
src and dst are in the same.

issue 2:
after metad leader changed, cannot stop the job.

(root@nebula) [(none)]>       CREATE SPACE test(vid_type=int, replica_factor=3, partition_num=10) on "z1","z2","z3","z4";
Execution succeeded (time spent 10524/20016 us)

Thu, 06 Jan 2022 17:47:06 CST

(root@nebula) [(none)]> use test
Execution succeeded (time spent 2591/13982 us)

Thu, 06 Jan 2022 17:47:10 CST

(root@nebula) [test]>
(root@nebula) [test]> show parts
+--------------+-------------------+-----------------------------------------------------+-------+
| Partition ID | Leader            | Peers                                               | Losts |
+--------------+-------------------+-----------------------------------------------------+-------+
| 1            | "127.0.0.1:12391" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:17318" | ""    |
| 2            | "127.0.0.1:18320" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:19179" | ""    |
| 3            | "127.0.0.1:17318" | "127.0.0.1:12391, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 4            | "127.0.0.1:19179" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 5            | "127.0.0.1:17318" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:17318" | ""    |
| 6            | "127.0.0.1:18320" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:19179" | ""    |
| 7            | "127.0.0.1:12391" | "127.0.0.1:12391, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 8            | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 9            | "127.0.0.1:17318" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:17318" | ""    |
| 10           | "127.0.0.1:18320" | "127.0.0.1:12391, 127.0.0.1:18320, 127.0.0.1:19179" | ""    |
+--------------+-------------------+-----------------------------------------------------+-------+
Got 10 rows (time spent 2588/16467 us)

Thu, 06 Jan 2022 17:47:12 CST

(root@nebula) [test]> SUBMIT JOB BALANCE IN ZONE REMOVE 127.0.0.1:12391
+------------+
| New Job Id |
+------------+
| 2          |
+------------+
Got 1 rows (time spent 6921/24940 us)

Thu, 06 Jan 2022 17:47:39 CST

(root@nebula) [test]> show job 2
+------------------------+------------------------------------+---------------+---------------------------------+----------------------------+
| Job Id(spaceId:partId) | Command(src->dst)                  | Status        | Start Time                      | Stop Time                  |
+------------------------+------------------------------------+---------------+---------------------------------+----------------------------+
| 2                      | "DATA_BALANCE"                     | "RUNNING"     | "2022-01-06T09:47:39.000000000" | "__EMPTY__"                |
| "2, 1:1"               | "127.0.0.1:12391->127.0.0.1:12391" | "IN_PROGRESS" | 2022-01-06T09:47:39.000000      |                            |
| "2, 1:2"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "2, 1:3"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "2, 1:5"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "2, 1:6"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "2, 1:7"               | "127.0.0.1:12391->127.0.0.1:12391" | "IN_PROGRESS" | 2022-01-06T09:47:39.000000      |                            |
| "2, 1:9"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "2, 1:10"              | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"      | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000 |
| "Total:8"              | "Succeeded:0"                      | "Failed:6"    | "In Progress:2"                 | "Invalid:0"                |
+------------------------+------------------------------------+---------------+---------------------------------+----------------------------+
Got 10 rows (time spent 2760/38490 us)

Thu, 06 Jan 2022 17:47:42 CST

(root@nebula) [test]> show job 2
+------------------------+------------------------------------+------------+---------------------------------+---------------------------------+
| Job Id(spaceId:partId) | Command(src->dst)                  | Status     | Start Time                      | Stop Time                       |
+------------------------+------------------------------------+------------+---------------------------------+---------------------------------+
| 2                      | "DATA_BALANCE"                     | "FAILED"   | "2022-01-06T09:47:39.000000000" | "2022-01-06T09:47:44.000000000" |
| "2, 1:1"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:44.000000      |
| "2, 1:2"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "2, 1:3"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "2, 1:5"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "2, 1:6"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "2, 1:7"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:44.000000      |
| "2, 1:9"               | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "2, 1:10"              | "127.0.0.1:12391->127.0.0.1:12391" | "FAILED"   | 2022-01-06T09:47:39.000000      | 2022-01-06T09:47:39.000000      |
| "Total:8"              | "Succeeded:0"                      | "Failed:8" | "In Progress:0"                 | "Invalid:0"                     |
+------------------------+------------------------------------+------------+---------------------------------+---------------------------------+
Got 10 rows (time spent 2420/57369 us)

Thu, 06 Jan 2022 17:47:45 CST
(root@nebula) [test]> show parts
+--------------+-------------------+-----------------------------------------------------+-------+
| Partition ID | Leader            | Peers                                               | Losts |
+--------------+-------------------+-----------------------------------------------------+-------+
| 1            | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:12391" | ""    |
| 2            | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:19179, 127.0.0.1:12391" | ""    |
| 3            | "127.0.0.1:17318" | "127.0.0.1:17318, 127.0.0.1:19179, 127.0.0.1:12391" | ""    |
| 4            | "127.0.0.1:19179" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 5            | "127.0.0.1:17318" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:12391" | ""    |
| 6            | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:19179, 127.0.0.1:12391" | ""    |
| 7            | "127.0.0.1:17318" | "127.0.0.1:17318, 127.0.0.1:19179, 127.0.0.1:12391" | ""    |
| 8            | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:19179" | ""    |
| 9            | "127.0.0.1:17318" | "127.0.0.1:18320, 127.0.0.1:17318, 127.0.0.1:12391" | ""    |
| 10           | "127.0.0.1:18320" | "127.0.0.1:18320, 127.0.0.1:19179, 127.0.0.1:12391" | ""    |
+--------------+-------------------+-----------------------------------------------------+-------+
Got 10 rows (time spent 2836/7545 us)

Thu, 06 Jan 2022 18:24:32 CST

(root@nebula) [test]> SUBMIT JOB BALANCE IN ZONE REMOVE 127.0.0.1:12391
+------------+
| New Job Id |
+------------+
| 3          |
+------------+
Got 1 rows (time spent 6842/10417 us)

Thu, 06 Jan 2022 18:24:59 CST

(root@nebula) [test]> show job 3
[ERROR (-1005)]: LeaderChanged: Leader changed!

Thu, 06 Jan 2022 18:25:05 CST

(root@nebula) [test]> show job 3
+------------------------+-------------------+------------+---------------------------------+-------------+
| Job Id(spaceId:partId) | Command(src->dst) | Status     | Start Time                      | Stop Time   |
+------------------------+-------------------+------------+---------------------------------+-------------+
| 3                      | "DATA_BALANCE"    | "RUNNING"  | "2022-01-06T10:24:59.000000000" | "__EMPTY__" |
| "Total:0"              | "Succeeded:0"     | "Failed:0" | "In Progress:0"                 | "Invalid:0" |
+------------------------+-------------------+------------+---------------------------------+-------------+
Got 2 rows (time spent 3328/7089 us)

Thu, 06 Jan 2022 18:25:07 CST

(root@nebula) [test]> show jobs
+--------+----------------+-----------+----------------------------+----------------------------+
| Job Id | Command        | Status    | Start Time                 | Stop Time                  |
+--------+----------------+-----------+----------------------------+----------------------------+
| 3      | "DATA_BALANCE" | "RUNNING" | 2022-01-06T10:24:59.000000 |                            |
| 2      | "DATA_BALANCE" | "FAILED"  | 2022-01-06T09:47:39.000000 | 2022-01-06T09:47:44.000000 |
+--------+----------------+-----------+----------------------------+----------------------------+
Got 2 rows (time spent 7155/11466 us)

Thu, 06 Jan 2022 18:25:38 CST
@HarrisChu HarrisChu added the type/bug Type: something is unexpected label Jan 6, 2022
@HarrisChu
Copy link
Contributor Author

issue 2, metad is crashed.

(gdb) bt
#0  0x00000000011ef93c in nebula::meta::DataBalanceJobExecutor::buildBalancePlan() ()
#1  0x00000000011f0de9 in nebula::meta::DataBalanceJobExecutor::executeInternal() ()
#2  0x00000000011e5908 in nebula::meta::MetaJobExecutor::execute() ()
#3  0x00000000011c97d6 in nebula::meta::JobManager::runJobInternal(nebula::meta::JobDescription const&, nebula::meta::JobManager::JbOp) ()
#4  0x00000000011cf452 in nebula::meta::JobManager::scheduleThread() ()
#5  0x000000000299a090 in execute_native_thread_routine ()
#6  0x00007f2cd6984ea5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2cd66adb0d in clone () from /lib64/libc.so.6

@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Jan 6, 2022
@HarrisChu HarrisChu mentioned this issue Jan 10, 2022
11 tasks
@Sophie-Xie Sophie-Xie linked a pull request Jan 10, 2022 that will close this issue
11 tasks
@liwenhui-soul
Copy link
Contributor

which zone does 127.0.0.1:12391 belong to, and which hosts does the zone contain?

@HarrisChu
Copy link
Contributor Author

in a word, balance in zone remove all hosts in one zone.
I think would be fixed in 3668

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants