Skip to content

Hardcoded context timeouts in GetClusterInfo/GetTopicsOverview cause failures on large clusters #2410

@grassiale

Description

@grassiale

Errors observed:

WARN "failed to fetch topic configs to return cleanup.policy" error="context deadline exceeded"
WARN "failed to describe log dirs from some shards" failed_shards=1
WARN "shard error for describing log dirs" broker_id=X error="the internal broker struct chosen to issue this request has died"

Description:

GetClusterInfo

childCtx, cancel := context.WithTimeout(egCtx, 6*time.Second)
uses a hardcoded 6s timeout for DescribeLogDirs + Metadata.
and GetTopicsOverview
childCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
uses a 5s timeout for DescribeConfigs.

These are not configurable.

On large AWS MSK clusters (thousands of partitions) with IAM authentication, these timeouts could be insufficient due to overhead of sts token exchange with IAM?

The same cluster works fine with mTLS auth .

Debug Logs

12:34:30.625 wrote DescribeConfigs v4  broker=2  bytes_written=214330
12:34:35.566 read  DescribeConfigs v4  broker=2  bytes_read=0  time_to_read=4.941s  err="context deadline exceeded"
12:34:35.566 read from broker errored, killing connection  broker=2  successful_reads=0
12:34:35.571 failed to describe log dirs from some shards  failed_shards=1
12:34:35.571 shard error for describing log dirs  broker_id=2  error="broker struct has died"
12:33:10.460 read Metadata v12  bytes_read=505207  time_to_read=5.790s
12:34:30.561 read Metadata v12  bytes_read=505207  time_to_read=3.770s

Should these timeouts configurable via the kafka config section?
I see the code comment that says "shorter timeout because otherwise we'll potentially have very long response times in case of a single broker being down"
But shouldn't that be left to user decision?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions