Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test_flipping_decommission_recommission timing out #7874

Closed
dotnwat opened this issue Dec 20, 2022 · 8 comments · Fixed by #7862, #7949 or #8167
Closed

test: test_flipping_decommission_recommission timing out #7874

dotnwat opened this issue Dec 20, 2022 · 8 comments · Fixed by #7862, #7949 or #8167
Assignees
Labels
area/controller ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@dotnwat
Copy link
Member

dotnwat commented Dec 20, 2022

https://buildkite.com/redpanda/redpanda/builds/20055#01852e27-6f45-4278-ac4f-b783fe03b4e6

The test checks if reconfiguration reports a decommissioned node. It appears that perhaps the decommission happened so quickly that it doesn't show up in the reconfigurations output.

[INFO  - 2022-12-20 06:44:37,763 - nodes_decommissioning_test - test_flipping_decommission_recommission - lineno:547]: decommissioning node: 2
[DEBUG - 2022-12-20 06:44:37,763 - admin - decommission_broker - lineno:468]: decommissioning brokers/2/decommission
[DEBUG - 2022-12-20 06:44:37,763 - admin - _request - lineno:305]: Dispatching put http://docker-rp-20:9644/v1/brokers/2/decommission
[DEBUG - 2022-12-20 06:44:37,766 - admin - _request - lineno:334]: Response OK
[DEBUG - 2022-12-20 06:44:37,767 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:37,770 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:38,773 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:38,778 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:39,779 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:39,782 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:40,785 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:40,789 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:41,791 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:41,795 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:42,797 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:42,800 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:43,801 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:43,805 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:44,807 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:44,814 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:45,817 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:45,822 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:46,825 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:46,828 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:47,828 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:47,834 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:48,835 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:48,838 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:49,841 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:49,844 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:50,845 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:50,848 - admin - _request - lineno:328]: Response OK, JSON: []
[DEBUG - 2022-12-20 06:44:51,850 - admin - _request - lineno:305]: Dispatching get http://docker-rp-17:9644/v1/partitions/reconfigurations
[DEBUG - 2022-12-20 06:44:51,854 - admin - _request - lineno:328]: Response OK, JSON: []
[ERROR - 2022-12-20 06:44:52,864 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 549, in test_flipping_decommission_recommission
    wait_until(lambda: self._partitions_moving(node=survivor_node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@dotnwat dotnwat added kind/bug Something isn't working area/controller ci-failure sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages labels Dec 20, 2022
@abhijat
Copy link
Contributor

abhijat commented Dec 21, 2022

@abhijat
Copy link
Contributor

abhijat commented Dec 21, 2022

seen in https://buildkite.com/redpanda/redpanda/builds/20139#0185347e-9dca-450c-b2b3-e4416f06419b for PR #7890

test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission
--
  | status:     FAIL
  | run time:   2 minutes 6.743 seconds
  |  
  |  
  | TimeoutError('')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 549, in test_flipping_decommission_recommission
  | wait_until(lambda: self._partitions_moving(node=survivor_node),
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError

@ztlpn
Copy link
Contributor

ztlpn commented Dec 22, 2022

reopening as this was seen again in https://buildkite.com/redpanda/redpanda/builds/20242#01853b28-063a-446c-9dfb-6d5bbe595e56 and the build there included the code from #7862

@mmaslankaprv
Copy link
Member

fixing

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Dec 23, 2022
Added check if the decommissioning finished before requesting
recommission. If the node was already delete it is fine and we can pass
the test.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Dec 23, 2022
Added check if the decommissioning finished before requesting
recommission. If the node was already delete it is fine and we can pass
the test.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@graphcareful
Copy link
Contributor

Seen here 3x in a row: #7900

@ballard26
Copy link
Contributor

Seen here in this PR

@BenPope
Copy link
Member

BenPope commented Dec 30, 2022

FAIL test: NodesDecommissioningTest.test_flipping_decommission_recommission (5/153 runs)
failure at 2022-12-23T14:33:49.336Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/20304#01853f40-2882-49e4-bc20-3280c14e8740
failure at 2022-12-24T07:10:56.631Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/20344#018542d4-d6f3-48d7-8774-c9daf2bd7be9
failure at 2022-12-24T06:13:20.035Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/20340#018542a0-1ca3-437d-b907-9f0f0eda1265
failure at 2022-12-24T15:28:11.263Z: TimeoutError('')
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4894#0185432b-ee15-4712-b049-dd6f66506ba4
failure at 2022-12-24T04:13:23.294Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 2, 2023
Added check if the decommissioning finished before requesting
recommission. If the node was already delete it is fine and we can pass
the test.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@andrwng
Copy link
Contributor

andrwng commented Jan 9, 2023

I think we're still seeing this. Just hit in CI: https://buildkite.com/redpanda/redpanda/builds/20824#0185953c-063a-4a75-9efe-fd840722a9a8

The error looks to be in a slightly different spot, but it's still a timeout:

====================================================================================================
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission
status:     FAIL
run time:   5 minutes 16.637 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 567, in test_flipping_decommission_recommission
    wait_until(lambda: self._node_removed(node_id, survivor_node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@andrwng andrwng reopened this Jan 9, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 9, 2023
Adjusted node decommission test timeouts to allow tests to pass with
slow debug builds.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 9, 2023
Adjusted node decommission test timeouts to allow tests to pass with
slow debug builds.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Mar 6, 2023
Added check if the decommissioning finished before requesting
recommission. If the node was already delete it is fine and we can pass
the test.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3142bac)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Mar 14, 2023
Added check if the decommissioning finished before requesting
recommission. If the node was already delete it is fine and we can pass
the test.

Fixes: redpanda-data#7874

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3142bac)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment