-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pp: broker hang with gate_closed_exception
on fail to remove consumer
#9310
Comments
@NyaliaLui how can this be reproduced? |
To reproduce this, add the following python code to the ducktape tests in class SampleTest(PandaProxyTestMethods):
def __init__(self, context):
super(SampleTest, self).__init__(context)
@cluster(num_nodes=3)
def test_restart_http_proxy(self):
self.topics = [TopicSpec(partition_count=3)]
self._create_initial_topics()
topic_name = self.topic
data = '''
{
"records": [
{"value": "dmVjdG9yaXplZA==", "partition": 0},
{"value": "cGFuZGFwcm94eQ==", "partition": 1},
{"value": "bXVsdGlicm9rZXI=", "partition": 2}
]
}'''
self.logger.info(f"Producing to topic: {topic_name}")
p_res = self._produce_topic(topic_name, data)
assert p_res.status_code == requests.codes.ok
self.logger.info("Check consumer offsets")
group_id = f"pandaproxy-group-{uuid.uuid4()}"
self.logger.debug(f"Create a consumer and subscribe to topic: {topic_name}")
cc_res = self._create_consumer(group_id)
assert cc_res.status_code == requests.codes.ok
c0 = Consumer(cc_res.json(), self.logger)
sc_res = c0.subscribe([topic_name])
assert sc_res.status_code == requests.codes.no_content
parts = [0, 1, 2]
sco_req = dict(partitions=[
dict(topic=topic_name, partition=p, offset=0)
for p in parts
])
co_res_raw = c0.set_offsets(data=json.dumps(sco_req))
assert co_res_raw.status_code == requests.codes.no_content
self.logger.debug(f"Check consumer offsets")
co_req = dict(partitions=[
dict(topic=topic_name, partition=p) for p in parts
])
offset_result_raw = c0.get_offsets(data=json.dumps(co_req))
assert offset_result_raw.status_code == requests.codes.ok
res = offset_result_raw.json()
# Should be one offset for each partition
assert len(res["offsets"]) == len(parts)
for r in res["offsets"]:
assert r["topic"] == topic_name
assert r["partition"] in parts
assert r["offset"] == 0
assert r["metadata"] == ""
self.logger.debug("Restart the http proxy")
admin = Admin(self.redpanda)
result_raw = admin.redpanda_services_restart(rp_service='http-proxy')
check_service_restart(self.redpanda, "Restarting the http proxy")
self.logger.debug(result_raw)
assert result_raw.status_code == requests.codes.ok
print('check offset begin')
self.logger.debug("Check consumer offsets after restart")
offset_result_raw = c0.get_offsets(data=json.dumps(co_req))
assert offset_result_raw.status_code == requests.codes.ok
res = offset_result_raw.json()
# Should be one offset for each partition
assert len(res["offsets"]) == len(parts)
for r in res["offsets"]:
assert r["topic"] == topic_name
assert r["partition"] in parts
assert r["offset"] == 0
assert r["metadata"] == ""
print('check offset end') CC: @dotnwat |
After spending some hours investigate this, I determined that the broker hang happens after restarting the http proxy and we issue a REST request to get consumer offsets. self.logger.debug("Restart the http proxy")
admin = Admin(self.redpanda)
result_raw = admin.redpanda_services_restart(rp_service='http-proxy')
check_service_restart(self.redpanda, "Restarting the http proxy")
self.logger.debug(result_raw)
assert result_raw.status_code == requests.codes.ok
print('check offset begin')
self.logger.debug("Check consumer offsets after restart")
offset_result_raw = c0.get_offsets(data=json.dumps(co_req)) # <-- problem happens here
assert offset_result_raw.status_code == requests.codes.ok
res = offset_result_raw.json()
# Should be one offset for each partition
assert len(res["offsets"]) == len(parts)
for r in res["offsets"]:
assert r["topic"] == topic_name
assert r["partition"] in parts
assert r["offset"] == 0
assert r["metadata"] == ""
print('check offset end') Furthermore, I tried running this test manually in my local machine. The results were that the brokers did shutdown gracefully (I used SIGTERM just like we do in Therefore, I suspect that this issue has more to do with my ducktape test in #9105 and this may not be a real bug. I'll mark this issue as |
So, it turns out the root cause and solution are more simple than previously thought. When a user creates a consumer with the REST request .then([this, group_id, name](shared_broker_t coordinator) mutable {
// Called when the consumer is stopped
auto on_stopped = [this, group_id](const member_id& name) {
_consumers[group_id].erase(name);
};
// Creates a consumer as a lw_shared_ptr, alias is shared_consumer_t
return make_consumer(
_config,
_topic_cache,
_brokers,
std::move(coordinator),
std::move(group_id),
std::move(name),
std::move(on_stopped));
})
.then([this, group_id](shared_consumer_t c) {
auto name = c->name();
// Add the consumer to the _consumers map
_consumers[group_id].insert(std::move(c));
return name;
}); When a user restarts the PP, // Called when the consumer is stopped
auto on_stopped = [this, group_id](const member_id& name) {
_consumers[group_id].erase(name); // <-- find's the consumer to erase by name
}; Now let's look at the definition of ss::future<> consumer::stop() {
// .. timers and stuff are canceled ..
_on_stopped(_name); // <-- Problem is here
if (_as.abort_requested()) {
return ss::now();
}
_as.request_abort();
return _coordinator->stop()
.then([this]() { return _gate.close(); }) // <-- gate is closed
.finally([me{shared_from_this()}] {});
} We see that const kafka::member_id& name() const {
// kafka::no_member is an alias for empty string
return _name != kafka::no_member ? _name : _member_id;
} This is interesting, when we insert a consumer into the struct consumer_hash {
// Hash based on consumer.name or the member_id
using is_transparent = void;
size_t operator()(const member_id& id) const {
return absl::Hash<member_id>{}(id);
}
size_t operator()(const consumer& c) const { return (*this)(c.name()); }
size_t operator()(const shared_consumer_t& c) const {
return (*this)(c->name()); // <-- this is called when the consumer is inserted into the map
}
}; So, we inserted the consumer based on a hash of ss::future<> consumer::stop() {
// .. timers and stuff are canceled ..
_on_stopped(name()); // <-- Solution here
if (_as.abort_requested()) {
return ss::now();
}
_as.request_abort();
return _coordinator->stop()
.then([this]() { return _gate.close(); })
.finally([me{shared_from_this()}] {});
} |
The reason why this manifested as a
ss::future<> client::stop() noexcept {
co_await _gate.close();
co_await catch_and_log([this]() { return _producer.stop(); });
for (auto& [id, group] : _consumers) {
while (!group.empty()) {
auto c = *group.begin();
co_await catch_and_log([c]() {
// The consumer is constructed with an on_stopped which erases
// istelf from the map after leave() completes.
return c->leave(); // <-- issues leave_group_req and eventually calls consumer::stop()
});
}
}
co_await catch_and_log([this]() { return _brokers.stop(); });
}
ss::future<leave_group_response> consumer::leave() {
auto req_builder = [this] {
return leave_group_request{
.data{.group_id = _group_id, .member_id = _member_id}};
};
// Leave the group before calling stop()
return req_res(std::move(req_builder)).finally([me{shared_from_this()}]() {
return me->stop();
});
}
template<typename RequestFactory>
ss::future<
typename std::invoke_result_t<RequestFactory>::api_type::response_type>
req_res(RequestFactory req) {
using api_t = typename std::invoke_result_t<RequestFactory>::api_type;
using response_t = typename api_t::response_type;
return ss::try_with_gate(_gate, [this, req{std::move(req)}]() { // <-- attempt to enter gate
auto r = req();
kclog.debug("Consumer: {}: {} req: {}", *this, api_t::name, r);
return _coordinator->dispatch(std::move(r))
.then([this](response_t res) {
kclog.debug(
"Consumer: {}: {} res: {}", *this, api_t::name, res);
return res;
});
});
}
ss::future<> consumer::stop() {
// .. timers and stuff are canceled ..
_on_stopped(_name); // <-- Empty string passed in
if (_as.abort_requested()) {
return ss::now();
}
_as.request_abort();
return _coordinator->stop()
.then([this]() { return _gate.close(); }) // <-- gate closed here
.finally([me{shared_from_this()}] {});
}
auto on_stopped = [this, group_id](const member_id& name) {
_consumers[group_id].erase(name); // <-- silently fails because name is empty
}; Since |
Because this is a real bug that resulted in a crash, I'll change the severity tag to |
Previously, the value of `_name` was passed into on_stopped. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. This fixes an issue where a kafka client fails to remove a consumer from it's internal map of consumers which can lead to an infinite loop of gate_closed_exceptions. Fixes redpanda-data#9310
awesome analysis @NyaliaLui |
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310 (cherry picked from commit 791cf6e)
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310 (cherry picked from commit 791cf6e)
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310 (cherry picked from commit 791cf6e)
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310 (cherry picked from commit 791cf6e)
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310 (cherry picked from commit 791cf6e)
This was awesome work, @NyaliaLui ! |
Previously, the value of `_name` was passed into `on_stopped`. This is a problem in the case of an unnamed consumer because `_name` will be an empty string. Instead, use the result of `consumer::name()`. Here is the definition of `on_stopped`: ```cpp auto on_stopped = [this, group_id](const member_id& name) { _consumers[group_id].erase(name); // silenty fails if name is empty }; ``` The result of passing an empty string to `on_stopped` is that the consumer is not erased from the map. This is an issue because the client expects the map to eventually become empty. From `client::stop()`: ```cpp for (auto& [id, group] : _consumers) { while (!group.empty()) { // group is never empty, so infinite loop auto c = *group.begin(); co_await catch_and_log([c]() { return c->leave(); }); } } ``` Instead, the while loop becaomes an inifite loop since the consumer is never removed. Fixes redpanda-data#9310
RP will hang/not-shutdown with many "gate closed exceptions" when someone fails to remove a consumer. I'll create a ticket for this and do some testing and more investigation into solutions. But I will also check with @BenPope about this next week because I may be doing something wrong as well.
Originally posted by @NyaliaLui in #9105 (comment)
The text was updated successfully, but these errors were encountered: