Skip to content

fix(tmq): remove epset that is useless#35126

Merged
guanshengliang merged 8 commits into3.0from
feat/6923036499
Apr 16, 2026
Merged

fix(tmq): remove epset that is useless#35126
guanshengliang merged 8 commits into3.0from
feat/6923036499

Conversation

@wangmm0220
Copy link
Copy Markdown
Contributor

@wangmm0220 wangmm0220 commented Apr 14, 2026

Description

Issue(s)

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

Copilot AI review requested due to automatic review settings April 14, 2026 02:43
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR primarily updates TMQ subscription/consumer state handling by removing persisted endpoint sets (epset) where they’re no longer needed, and adjusts related audit behavior/tests and documentation.

Changes:

  • TMQ: remove persisted SEpSet/SMqVgEp from subscription/consumer SDB state, bumping related SDB soft versions and updating rebalance logic to track vgId only.
  • Audit: adjust audit user/token handling and ensure audit STB is created for audit DBs regardless of auditSaveInSelf, adding CI test coverage for “no token” and “self audit switch off”.
  • Misc: update client callback to fully clear “table not exist as empty” error state; update TDgpt anomaly-detection docs for multi-column input.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
source/dnode/mnode/impl/src/mndSubscribe.c Switch unassigned/assigned vgroup tracking to int32_t vgId only; bump subscribe SDB version; remove epset update logic.
source/dnode/mnode/impl/src/mndConsumer.c Switch consumer ep tracking to vgId only and build response epsets dynamically; bump consumer SDB version.
source/dnode/mnode/impl/src/mndDef.c Update encode/decode for subscribe/consumer objects to drop epset fields while keeping backward decode compatibility.
source/dnode/mnode/impl/inc/mndDef.h Remove SMqVgEp and consumer ep field; update structures to store only vgId.
source/dnode/mnode/impl/src/mndDb.c Always create audit STB when creating an audit DB (no longer gated by tsAuditSaveInSelf).
source/dnode/mnode/impl/src/mndDnode.c Fetch active token for the configured audit user (not hardcoded "audit").
source/libs/new-stream/src/streamTriggerTask.c Adjust trigger history logic (count/session/meta-related behavior).
source/client/src/clientImpl.c Clear request-level error code/message when treating “table not exist” as empty result.
test/ci/cases.task Add the two new component test cases to CI task list.
test/cases/80-Components/01-Taosd/test_com_taosd_audit.py Update audit test to use cus_audit user and token creation steps.
test/cases/80-Components/01-Taosd/test_com_taosd_audit_no_token.py New test for audit telemetry with auditUseToken=0.
test/cases/80-Components/01-Taosd/test_com_taosd_self_audit_switch_off.py New test for switching auditSaveInSelf behavior.
docs/zh/.../03-ad.md, docs/en/.../03-ad.md, docs/en/.../index.md Document multi-column anomaly detection input behavior and new self.input_data_lists.
Comments suppressed due to low confidence (2)

source/libs/new-stream/src/streamTriggerTask.c:11755

  • readAllData is now hardcoded to true and never modified, so the metadata-accelerated branch (if (!readAllData)) becomes dead code and count-window history checks will always scan full data blocks. Either restore the previous conditional so the optimized path can be used when safe, or delete the unreachable branch to avoid misleading future changes (and consider the performance impact of always scanning).
  SStreamTriggerTask      *pTask = pContext->pTask;
  bool                     readAllData = true;
  bool                     allTableProcessed = false;
  bool                     needFetchData = false;

#define ALIGN_UP(x, b) (((x) + (b) - 1) / (b) * (b))
  while (!allTableProcessed && !needFetchData) {
    if (!readAllData) {
      // use table metadatas to accelerate the count window check
      if (IS_TRIGGER_TIMESTAMP_SORTER_EMPTY(pContext->pSorter)) {
        stTimestampSorterReset(pContext->pSorter);
        pContext->pCurTableMeta = tSimpleHashIterate(pGroup->pTableMetas, pContext->pCurTableMeta, &pContext->tbIter);
        if (pContext->pCurTableMeta == NULL) {
          // actually, it has only one table

source/libs/new-stream/src/streamTriggerTask.c:2600

  • This change drops STREAM_TRIGGER_COUNT from the histTrigTsIndex initialization special-case. If count-window history blocks still place the trigger timestamp in slot 0 (as before), switching to trigTsIndex can make history logic read the wrong column as the timestamp. Please confirm the slot layout for count-trigger history blocks and keep STREAM_TRIGGER_COUNT in the special-case if its history scan always uses ts at index 0.
  if (pTask->triggerType == STREAM_TRIGGER_SESSION || pTask->triggerType == STREAM_TRIGGER_SLIDING) {
    pTask->histTrigTsIndex = 0;
  } else {
    pTask->histTrigTsIndex = pTask->trigTsIndex;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/new-stream/src/streamTriggerTask.c
Comment thread test/cases/80-Components/01-Taosd/test_com_taosd_audit_no_token.py
Comment thread test/cases/80-Components/01-Taosd/test_com_taosd_audit_no_token.py
@wangmm0220 wangmm0220 closed this Apr 14, 2026
wangmm0220 and others added 2 commits April 14, 2026 15:42
Verify that after changing taosd's epset (fqdn + port) and force killing,
WAL replay of old rebalance transactions fails until ep.json provides
the epset mapping to resolve pending transactions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wangmm0220 wangmm0220 reopened this Apr 14, 2026
@JinqingKuang
Copy link
Copy Markdown
Contributor

Code Review 结果

本次审查覆盖范围:Change Summarizer、Broad Scanner、Performance Reviewer、Maintainability Reviewer、Rule Reviewer 全部完成。


变更概要

本 PR 将 TMQ 订阅/消费者数据模型中的 SArray<SMqVgEp> 替换为 SArray<int32_t>(纯 vgId),彻底移除了持久化存储 epset 的设计。epset 改为在需要时通过 mndAcquireVgroup + mndGetVgroupEpset 动态获取,同时删除了 mndSubActionDecode/mndConsumerActionDecode 中的 tmsgUpdateDnodeEpSet 刷新逻辑。版本号 MND_CONSUMER_VER_NUMBERMND_SUBSCRIBE_VER_NUMBER 均从 3 升至 4,向下兼容逻辑(sver <= 3)正确处理了旧格式的 epSet 字节跳过。

发现 3 个已验证问题


问题 1 — [Medium/高置信] Correctness:vgroup 获取失败时 consumer 会丢失对应 vg 的订阅

位置source/dnode/mnode/impl/src/mndConsumer.cprocessEachTopicEp 函数

问题:旧代码在 epoch != -1 时,无论 vgroup 是否能被 acquire,都会将该 vg 加入 consumer 的 topic endpoint 列表(使用缓存的 epset)。新代码改为每次都 acquire vgroup,若 mndAcquireVgroup 返回 NULL,该 vg 被直接 continue 跳过,仅打一条 mWarn,不触发任何 rebalance 或重试。在 vnode split / vgroup 迁移的过渡窗口内,mndAcquireVgroup 可能短暂失败,导致 consumer 收到不完整的 endpoint 列表,造成消息消费缺口。

修复方向:若 acquire 失败,可在 mWarn 后将该 vg 写入一个"降级列表"并触发一次 rebalance 检查;或在日志中明确说明该 vg 会在下次 heartbeat 中补充,而不是静默跳过。


问题 2 — [Low-Medium/中置信] Performance:每次 consumer 订阅应答都触发全量 vgroup acquire

位置source/dnode/mnode/impl/src/mndConsumer.cprocessEachTopicEp 函数

问题:旧代码仅在 epoch == -1(consumer 初次连接)时才 acquire vgroup;新代码每次调用 processEachTopicEp 都对所有 assigned vg 做 mndAcquireVgroup(SDB read-lock)。在 consumer 数量多、vgroup 数量多的集群中,每次 heartbeat 响应都会有 N 次 SDB 读锁争用。单次 acquire 虽然是内存操作,但高频累积效果值得关注。

修复方向:考虑恢复类似 epoch 判断的轻量路径——对已建立稳定连接的 consumer 跳过全量 acquire,仅在 epoch 变化时刷新 epset。


问题 3 — [Low/高置信] Testing:新增测试在 CI 中被注释 + taos.cfg 修改后无清理

位置 Atest/ci/cases.task,新增行(#,,y,.,./ci/pytest.sh pytest cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py

问题 A:覆盖本 PR 核心修复场景(epset 变更后 WAL 重放卡住)的测试被注释掉,不会在 CI 中自动执行,本 PR 的回归保护实际缺失。

位置 Btest/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.pymodifyTaosCfg 方法

问题 B:测试直接读写 taos.cfg 文件,但测试结束(无论成功或失败)时只清理了 /etc/hosts 条目,taos.cfg 的 fqdn/port 修改未还原。若测试框架复用同一 dnode 目录,后续测试会继承被污染的配置。

修复方向

  • 说明 CI 注释的原因(需要 sudo 权限操作 /etc/hosts),并提供后续启用计划,或重构测试以去掉 sudo 依赖;
  • 在 teardown 或 finally 块中还原 taos.cfg,或在 buildSubscription 前备份原始 cfg 供测试后恢复。

Comment thread source/dnode/mnode/impl/src/mndConsumer.c Outdated
Comment thread test/ci/cases.task
Comment thread test/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py
Copilot AI review requested due to automatic review settings April 15, 2026 06:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/dnode/mnode/impl/src/mndConsumer.c
Comment thread test/ci/cases.task
Comment thread test/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py
Comment thread test/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py
Comment thread test/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py
Comment thread test/cases/17-DataSubscription/02-Consume/test_tmq_change_epset.py
@guanshengliang guanshengliang merged commit 891eac1 into 3.0 Apr 16, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants