Skip to content

[fix][test] fix Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer #24371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

berg223
Copy link
Contributor

@berg223 berg223 commented Jun 2, 2025

Fixes #24357

Main Issue: #24357

Motivation

The unit test failed because loadManager4.getBrokerRegistry().unregister(); is not enough to ensure a successful unregister. The root cause is that broker will register it self after unregister. This happens in two place:

  1. There is a scheduler task named monitorTask in ExtensibleLoadManagerImpl, it will periodically register self.
  2. When brokerRegistry start, it will register a listener. Once broker unregistered, it will callback handleMetadataStoreNotification and try to register itself again.

Modifications

The modifications contains three part:

  1. changes not in testLoadBalancerServiceUnitTableViewSyncer is just for formatting. Please ignore it!
  2. changes to resolve this issue is:
               // simulate pulsar4 store error which will unregister the broker
                var brokerRegistry = loadManager4.getBrokerRegistry();
                var brokerLookupDataMetadataCache = (MetadataCache<BrokerLookupData>) spy(
                        FieldUtils.readField(brokerRegistry, "brokerLookupDataMetadataCache", true));
                doReturn(CompletableFuture.failedFuture(new MetadataStoreException("store error"))).when(
                        brokerLookupDataMetadataCache).put(anyString(), any(), any());
                FieldUtils.writeDeclaredField(brokerRegistry, "brokerLookupDataMetadataCache",
                        brokerLookupDataMetadataCache, true);
                brokerRegistry.unregister();

                NamespaceName slaMonitorNamespace =
                        getSLAMonitorNamespace(pulsar4.getBrokerId(), pulsar.getConfiguration());
                String slaMonitorTopic = slaMonitorNamespace.getPersistentTopicName("test");

                Awaitility.await()
                        .atMost(10, TimeUnit.SECONDS)
                        .pollInterval(1, TimeUnit.SECONDS)
                        .untilAsserted(() -> {
                            String currentResult = pulsar.getAdminClient().lookups().lookupTopic(slaMonitorTopic);
                            assertNotNull(currentResult);
                            log.info("{} Namespace is re-owned by {}", slaMonitorTopic, currentResult);
                            assertNotEquals(currentResult, pulsar4.getBrokerServiceUrl());
                        });
  1. I have found other factor to fail the test when I debug in testLoadBalancerServiceUnitTableViewSyncer. The syncer is not started and it maybe cause exception sometime. So I changed it to:
String syncerTyp = serviceUnitStateTableViewClassName.equals(ServiceUnitStateTableViewImpl.class.getName()) ?
                "SystemTopicToMetadataStoreSyncer" : "MetadataStoreToSystemTopicSyncer";
        pulsar.getAdminClient().brokers()
                .updateDynamicConfiguration("loadBalancerServiceUnitTableViewSyncer", syncerTyp);
        Awaitility.waitAtMost(10, TimeUnit.SECONDS)
                .pollInterval(1, TimeUnit.SECONDS)
                .untilAsserted(() -> {
                    assertTrue(pulsar1.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
                    assertTrue(pulsar2.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
                });

        // We invoke monitor method to ensure SystemTopicToMetadataStoreSyncer to start or close because syncer will not be started or close after pulsar.getAdminClient().brokers().updateDynamicConfiguration();
.
        primaryLoadManager.monitor();
        secondaryLoadManager.monitor();

       // We invoke monitor method in makePrimaryAsLeader and makeSecondaryAsLeader because monitor can immediately trigger serviceUnitStateTableViewSyncer to start or close without wait.
        makeSecondaryAsLeader();
        makePrimaryAsLeader();
        assertTrue(primaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());
        assertFalse(secondaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

Copy link

github-actions bot commented Jun 2, 2025

@berg223 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-label-missing doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jun 2, 2025
@lhotari
Copy link
Member

lhotari commented Jun 2, 2025

/pulsarbot rerun-failure-checks

@berg223
Copy link
Contributor Author

berg223 commented Jun 2, 2025

/pulsarbot run-failure-checks

@berg223
Copy link
Contributor Author

berg223 commented Jun 6, 2025

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

@lhotari
Copy link
Member

lhotari commented Jun 6, 2025

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals.
This is explained in https://pulsar.apache.org/contribute/personal-ci/.

Comment on lines +848 to +849
protected static Pair<LookupService, LookupService> spyLookupService(PulsarClient client)
throws IllegalAccessException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are unnecessary formatting changes. In Pulsar, we use the line width of 120. Please ensure that you have the code style setup in the way described in https://pulsar.apache.org/contribute/setup-ide/. Most contributors use IntelliJ.

Copy link
Contributor Author

@berg223 berg223 Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are unnecessary formatting changes. In Pulsar, we use the line width of 120. Please ensure that you have the code style setup in the way described in https://pulsar.apache.org/contribute/setup-ide/. Most contributors use IntelliJ.

I have the code style setup before fomatting according to the doc you point out. Actually, the line has 121 characters. So it should be formatted! Let's skip the formmatting changes!

Comment on lines +1078 to +1079
Optional<BrokerLookupData> brokerLookupData =
primaryLoadManager.assign(Optional.empty(), bundle, LookupOptions.builder().build()).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these look unnecessary again, perhaps due to different line length in your IDE code style

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these look unnecessary again, perhaps due to different line length in your IDE code style

This line has 137 characters. So the formmatting is reasonable!

@lhotari
Copy link
Member

lhotari commented Jun 6, 2025

@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is #23995.

I was planning to handle to problem by introducing an internal control interface for LeaderElection

public interface LeaderElectionControl {
    void releasePossibleLeaderRole();
    void skipElections();
    void resumeElections();
}

tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed.
the WIP changes are in master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip

@berg223
Copy link
Contributor Author

berg223 commented Jun 8, 2025

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals. This is explained in https://pulsar.apache.org/contribute/personal-ci/.

Thanks for detail instructments. It's very valueful for me!

@berg223
Copy link
Contributor Author

berg223 commented Jun 8, 2025

@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is #23995.

I was planning to handle to problem by introducing an internal control interface for LeaderElection

public interface LeaderElectionControl {
    void releasePossibleLeaderRole();
    void skipElections();
    void resumeElections();
}

tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed. the WIP changes are in master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip

We are spying the case that our broker cannot connect to metadata store and the leadership hasn't change. The expected behavior should be that the loadbalanceManager of leader will reassign the bundles to other broker. So we won't use LeaderElectionControl in this case.

However, we can consider the clear and resuseable way to spy this case similiar to LeaderElectionControl. But I don't have a good idea for now.

@berg223
Copy link
Contributor Author

berg223 commented Jun 8, 2025

@lhotari I have setup the Personal CI. And all CI succeed except Upload Coverage (pull_request) . Is it okay to begin the testing after this PR reviewed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer
2 participants