Errors observed:
WARN "failed to fetch topic configs to return cleanup.policy" error="context deadline exceeded"
WARN "failed to describe log dirs from some shards" failed_shards=1
WARN "shard error for describing log dirs" broker_id=X error="the internal broker struct chosen to issue this request has died"
Description:
GetClusterInfo
|
childCtx, cancel := context.WithTimeout(egCtx, 6*time.Second) |
uses a hardcoded
6s timeout for
DescribeLogDirs + Metadata.
and
GetTopicsOverview
|
childCtx, cancel := context.WithTimeout(ctx, 5*time.Second) |
uses a
5s timeout for
DescribeConfigs.
These are not configurable.
On large AWS MSK clusters (thousands of partitions) with IAM authentication, these timeouts could be insufficient due to overhead of sts token exchange with IAM?
The same cluster works fine with mTLS auth .
Debug Logs
12:34:30.625 wrote DescribeConfigs v4 broker=2 bytes_written=214330
12:34:35.566 read DescribeConfigs v4 broker=2 bytes_read=0 time_to_read=4.941s err="context deadline exceeded"
12:34:35.566 read from broker errored, killing connection broker=2 successful_reads=0
12:34:35.571 failed to describe log dirs from some shards failed_shards=1
12:34:35.571 shard error for describing log dirs broker_id=2 error="broker struct has died"
12:33:10.460 read Metadata v12 bytes_read=505207 time_to_read=5.790s
12:34:30.561 read Metadata v12 bytes_read=505207 time_to_read=3.770s
Should these timeouts configurable via the kafka config section?
I see the code comment that says "shorter timeout because otherwise we'll potentially have very long response times in case of a single broker being down"
But shouldn't that be left to user decision?
Errors observed:
Description:
GetClusterInfoconsole/backend/pkg/console/cluster_info.go
Line 56 in 534690f
6stimeout forDescribeLogDirs+ Metadata.and
GetTopicsOverviewconsole/backend/pkg/console/topic_overview.go
Line 80 in 534690f
5stimeout forDescribeConfigs.These are not configurable.
On large AWS MSK clusters (thousands of partitions) with IAM authentication, these timeouts could be insufficient due to overhead of sts token exchange with IAM?
The same cluster works fine with mTLS auth .
Debug Logs
Should these timeouts configurable via the
kafkaconfig section?I see the code comment that says "shorter timeout because otherwise we'll potentially have very long response times in case of a single broker being down"
But shouldn't that be left to user decision?