Hardcoded context timeouts in GetClusterInfo/GetTopicsOverview cause failures on large clusters

**Errors observed:**
```
WARN "failed to fetch topic configs to return cleanup.policy" error="context deadline exceeded"
WARN "failed to describe log dirs from some shards" failed_shards=1
WARN "shard error for describing log dirs" broker_id=X error="the internal broker struct chosen to issue this request has died"
```

**Description:**

`GetClusterInfo` https://github.com/redpanda-data/console/blob/534690f8b52944feff76c0c596fc38c7e236ed3d/backend/pkg/console/cluster_info.go#L56 uses a hardcoded `6s` timeout for `DescribeLogDirs` + Metadata.
and `GetTopicsOverview` https://github.com/redpanda-data/console/blob/534690f8b52944feff76c0c596fc38c7e236ed3d/backend/pkg/console/topic_overview.go#L80 uses a `5s` timeout for `DescribeConfigs`. 

These are not configurable.

On large AWS MSK clusters (thousands of partitions) with IAM authentication, these timeouts could be insufficient due to overhead of sts token exchange with IAM?

The same cluster works fine with mTLS auth .

**Debug Logs** 
```
12:34:30.625 wrote DescribeConfigs v4  broker=2  bytes_written=214330
12:34:35.566 read  DescribeConfigs v4  broker=2  bytes_read=0  time_to_read=4.941s  err="context deadline exceeded"
12:34:35.566 read from broker errored, killing connection  broker=2  successful_reads=0
12:34:35.571 failed to describe log dirs from some shards  failed_shards=1
12:34:35.571 shard error for describing log dirs  broker_id=2  error="broker struct has died"
```

```
12:33:10.460 read Metadata v12  bytes_read=505207  time_to_read=5.790s
12:34:30.561 read Metadata v12  bytes_read=505207  time_to_read=3.770s
```

Should these timeouts configurable via the `kafka` config section? 
I see the code comment that says "shorter timeout because otherwise we'll potentially have very long response times in case of a single broker being down" 
But shouldn't that be left to user decision?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded context timeouts in GetClusterInfo/GetTopicsOverview cause failures on large clusters #2410

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardcoded context timeouts in GetClusterInfo/GetTopicsOverview cause failures on large clusters #2410

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions