Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix channel not balance on datanodes #40422

Merged

Conversation

bigsheeper
Copy link
Contributor

@bigsheeper bigsheeper commented Mar 6, 2025

  1. Prevent channels from being assigned to only one datanode during datacoord startup.
  2. Optimize the channel assignment policy by considering newly assigned channels.
  3. Make msgdispatcher manager lock-free.

issue: #40421, #37630

@sre-ci-robot sre-ci-robot added the size/S Denotes a PR that changes 10-29 lines. label Mar 6, 2025
@sre-ci-robot sre-ci-robot requested review from congqixia and sunby March 6, 2025 09:49
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Mar 6, 2025
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 91.94631% with 12 lines in your changes missing coverage. Please review.

Project coverage is 80.52%. Comparing base (2bd2cca) to head (2d80cab).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
internal/datacoord/policy.go 90.97% 8 Missing and 4 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #40422       +/-   ##
===========================================
+ Coverage   70.01%   80.52%   +10.50%     
===========================================
  Files         309     1474     +1165     
  Lines       27710   207517   +179807     
===========================================
+ Hits        19402   167104   +147702     
- Misses       8308    34343    +26035     
- Partials        0     6070     +6070     
Components Coverage Δ
Client 79.74% <ø> (∅)
Core 70.00% <ø> (-0.02%) ⬇️
Go 82.27% <91.94%> (∅)
Files with missing lines Coverage Δ
internal/datacoord/channel_manager.go 86.47% <100.00%> (ø)
pkg/mq/msgdispatcher/client.go 93.65% <ø> (ø)
pkg/mq/msgdispatcher/manager.go 72.54% <100.00%> (ø)
internal/datacoord/policy.go 80.00% <90.97%> (ø)

... and 1162 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bigsheeper bigsheeper force-pushed the 2503-fix-dn-channel-not-balance branch from c906a3a to 8f2502c Compare March 7, 2025 08:40
Copy link
Contributor

mergify bot commented Mar 7, 2025

@bigsheeper E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@czs007
Copy link
Collaborator

czs007 commented Mar 9, 2025

/run-cpu-e2e

@czs007
Copy link
Collaborator

czs007 commented Mar 9, 2025

[2025/03/07 09:21:58.603 +00:00] [INFO] [datacoord/channel_manager.go:197] ["register node"] ["registered node"=7]
channel_manager_test.go:73:
Error Trace: /go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:73
/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:813
Error: Should be true
Test: TestChannelManagerSuite/TestStartupNilSchema
channel_manager_test.go:74:
Error Trace: /go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:74
/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:813
Error: Expected value not to be nil.
Test: TestChannelManagerSuite/TestStartupNilSchema
panic.go:261: test panicked: runtime error: invalid memory address or nil pointer dereference
goroutine 15334 [running]:
runtime/debug.Stack()
/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/runtime/debug/stack.go:24 +0x67
github.com/stretchr/testify/suite.failOnPanic(0xc00cba2820, {0x69948a0, 0xa380b10})
/go/pkg/mod/github.com/stretchr/testify@v1.9.0/suite/suite.go:89 +0x5b
github.com/stretchr/testify/suite.Run.func1.1()
/go/pkg/mod/github.com/stretchr/testify@v1.9.0/suite/suite.go:188 +0x365
panic({0x69948a0?, 0xa380b10?})
/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/milvus-io/milvus/internal/datacoord.(*ChannelManagerSuite).checkAssignment(0xc005153ce0, 0xc00746f140, 0x7, {0x6fa968a, 0x3}, {0x6fb2b62, 0x7})
/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:75 +0x122
github.com/milvus-io/milvus/internal/datacoord.(*ChannelManagerSuite).TestStartupNilSchema(0xc005153ce0)
/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager_test.go:813 +0x1e68
reflect.Value.call({0xc001efa100?, 0xc003e75970?, 0xc0061c1c30?}, {0x6faa606, 0x4}, {0xc0061c1eb0, 0x1, 0x4f5c57?})
/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/reflect/value.go:596 +0xd5d
reflect.Value.Call({0xc001efa100?, 0xc003e75970?, 0xc00ceaf6c0?}, {0xc0061c1eb0, 0x1, 0x1})
/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/reflect/value.go:380 +0xb6

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper bigsheeper changed the title fix: Fix channel not balance on datanodes during cluster startup fix: Fix channel not balance on datanodes Mar 10, 2025
@bigsheeper bigsheeper force-pushed the 2503-fix-dn-channel-not-balance branch from 8f2502c to 04d6f23 Compare March 10, 2025 07:58
@sre-ci-robot sre-ci-robot added size/L Denotes a PR that changes 100-499 lines. and removed size/S Denotes a PR that changes 10-29 lines. labels Mar 10, 2025
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Copy link
Contributor

mergify bot commented Mar 10, 2025

@bigsheeper cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

@bigsheeper
Copy link
Contributor Author

rerun cpp-unit-test

@mergify mergify bot added the ci-passed label Mar 11, 2025
@czs007 czs007 added the PR | need cherry-pick need cherry pick to other branches label Mar 11, 2025
@czs007
Copy link
Collaborator

czs007 commented Mar 11, 2025

/approve
/lgtm

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bigsheeper, czs007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit a33c937 into milvus-io:master Mar 11, 2025
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm PR | need cherry-pick need cherry pick to other branches size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants