Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (internal redpanda assert!) in NodesDecommissioningTest.test_flipping_decommission_recommission #8218

Closed
rystsov opened this issue Jan 13, 2023 · 3 comments · Fixed by #8245
Assignees
Labels
area/controller ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 13, 2023

https://buildkite.com/redpanda/redpanda/builds/21131#0185aa2a-f6ee-4d97-a593-ba8122fc740d

Module: rptest.tests.nodes_decommissioning_test
Class:  NodesDecommissioningTest
Method: test_flipping_decommission_recommission
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission
status:     FAIL
run time:   1 minute 27.572 seconds

    <NodeCrash docker-rp-11: ERROR 2023-01-13 08:46:56,918 [shard 0] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0973197ceee080464-1/redpanda/redpanda/src/v/cluster/members_backend.cc:663) 'it != _decommission_command_revision.end()' members backend should hold a revision of nodes being decommissioned, node_id: 4
>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f41a615ca60>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-11', port=9644): Max retries exceeded with url: /v1/cluster_config/status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f41a615ca60>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 558, in test_flipping_decommission_recommission
    self.redpanda.set_cluster_config({"raft_learner_recovery_rate": 1},
  File "/root/tests/rptest/services/redpanda.py", line 1408, in set_cluster_config
    config_status = wait_until_result(
  File "/root/tests/rptest/util.py", line 90, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/util.py", line 77, in wrapped_condition
    cond = condition()
  File "/root/tests/rptest/services/redpanda.py", line 1399, in is_ready
    status = admin_client.get_cluster_config_status(
  File "/root/tests/rptest/services/admin.py", line 385, in get_cluster_config_status
    return self._request("GET", "cluster_config/status", node=node).json()
  File "/root/tests/rptest/services/admin.py", line 307, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-11', port=9644): Max retries exceeded with url: /v1/cluster_config/status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f41a615ca60>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 50, in wrapped
    self.redpanda.raise_on_crash()
  File "/root/tests/rptest/services/redpanda.py", line 1490, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash docker-rp-11: ERROR 2023-01-13 08:46:56,918 [shard 0] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0973197ceee080464-1/redpanda/redpanda/src/v/cluster/members_backend.cc:663) 'it != _decommission_command_revision.end()' members backend should hold a revision of nodes being decommissioned, node_id: 4
>
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 13, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 13, 2023

@mmaslankaprv there is an assert you've added is failing

@mmaslankaprv
Copy link
Member

will look into this. thanks for reporting.

@mmaslankaprv mmaslankaprv self-assigned this Jan 13, 2023
@dotnwat
Copy link
Member

dotnwat commented Jan 14, 2023

Seen here #8092

@jcsp jcsp added area/controller sev/high loss of availability, pathological performance degradation, recoverable corruption labels Jan 16, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 17, 2023
Members backend was tracking last node decommission revision to be able
to cancel all related partition movements. The tracking was broken as it
might be the case that the revision map was updated by the next
decommission update while the previous recommission was still being
processed. In order to fix this issue and simplify tracking of last
decommission revision id introduced tracking previous decommission
revision inside of recommission update metadata object. This way a
certain recommission action is always related with correct decommission
revision.

Fixes: redpanda-data#8218

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 51d721b)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 17, 2023
Members backend was tracking last node decommission revision to be able
to cancel all related partition movements. The tracking was broken as it
might be the case that the revision map was updated by the next
decommission update while the previous recommission was still being
processed. In order to fix this issue and simplify tracking of last
decommission revision id introduced tracking previous decommission
revision inside of recommission update metadata object. This way a
certain recommission action is always related with correct decommission
revision.

Fixes: redpanda-data#8218

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 51d721b)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 17, 2023
Members backend was tracking last node decommission revision to be able
to cancel all related partition movements. The tracking was broken as it
might be the case that the revision map was updated by the next
decommission update while the previous recommission was still being
processed. In order to fix this issue and simplify tracking of last
decommission revision id introduced tracking previous decommission
revision inside of recommission update metadata object. This way a
certain recommission action is always related with correct decommission
revision.

Fixes: redpanda-data#8218

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 51d721b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants