Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure: PartitionBalancerScaleTest.test_node_operations_at_scale.type=many_partitions kgo verifier service timeout #8290

Closed
abhijat opened this issue Jan 18, 2023 · 1 comment · Fixed by #8387
Assignees
Labels

Comments

@abhijat
Copy link
Contributor

abhijat commented Jan 18, 2023

Module: rptest.scale_tests.partition_balancer_scale_test
Class:  PartitionBalancerScaleTest
Method: test_node_operations_at_scale
Arguments:
{
  "type": "many_partitions"
}

https://buildkite.com/redpanda/vtools/builds/5274#0185bec3-01b7-464f-8830-acb66190c004

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/partition_balancer_scale_test.py", line 292, in test_node_operations_at_scale
    self.verify(topic.name, message_size, consumers)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/partition_balancer_scale_test.py", line 71, in verify
    self.producer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 595, in wait_node
    self._status_thread.raise_on_error()
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 247, in raise_on_error
    raise self._ex
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 254, in run
    self.poll_status()
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 309, in poll_status
    r = requests.get(self._parent._remote_url(self._node, "status"),
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-172-31-10-105', port=8080): Max retries exceeded with url: /status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4c6f5ec370>: Failed to establish a new connection: [Errno 111] Connection refused'))

The error seems to come up when connecting to kgo verifier service IP and port:

host='ip-172-31-10-105', port=8080

[INFO  - 2023-01-17 13:38:14,926 - service - stop - lineno:283]: KgoVerifierConsumerGroupConsumer-0-139966264358176 node 1 on ip-172-31-10-105: stopping node
@abhijat abhijat added kind/bug Something isn't working ci-failure labels Jan 18, 2023
@mmaslankaprv mmaslankaprv self-assigned this Jan 23, 2023
@mmaslankaprv
Copy link
Member

it looks that the consumer stopped because of :
time="2023-01-17T13:29:43Z" level=error msg="Produce failed: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition."

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 24, 2023
When moving partition the metadata may not reflect the actual replica
placement as the actual data movement may take time. If partition
metadata is correct but the partition not yet exists on a node we must
return retryable error to the client. There is a possibility that during
handling next client request partition will already be available on the
node.

Changed error returned when partition shard wasn't found from
`UNKNOWN_TOPIC_OR_PARTITION` which is not retryable to
`NOT_LEADER_FOR_PARTITION` which forces producer to retry.

Fixes: redpanda-data#8290

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 24, 2023
When moving partition the metadata may not reflect the actual replica
placement as the actual data movement may take time. If partition
metadata is correct but the partition not yet exists on a node we must
return retryable error to the client. There is a possibility that during
handling next client request partition will already be available on the
node.

Changed error returned when partition shard wasn't found from
`UNKNOWN_TOPIC_OR_PARTITION` which is not retryable to
`NOT_LEADER_FOR_PARTITION` which forces producer to retry.

Fixes: redpanda-data#8290

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants