Skip to content

enh(stmt2): refactoring stmt2 retry strategy#35139

Merged
guanshengliang merged 6 commits intomainfrom
enh/main/6944442785
Apr 21, 2026
Merged

enh(stmt2): refactoring stmt2 retry strategy#35139
guanshengliang merged 6 commits intomainfrom
enh/main/6944442785

Conversation

@Pengrongkun
Copy link
Copy Markdown
Contributor

Track tRowBuild-allocated rows explicitly in stmtPatch (aHeapRows) and free them before tDestroySubmitReq instead of inferring heap vs slab by address range, fixing invalid free of decoded-in-place SRow pointers. Use asyncQueryCb for internal async retries so the user asyncExecFn runs once with the final result; invoke the user callback when retry setup fails; remove asyncQueryCbRetry.
In stmt2Case.exec_retry, accept NULL for backfilled INT columns after ALTER ADD COLUMN (taos_fetch_row uses null pointers for SQL NULL).

Description

Issue(s)

  • Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

Track tRowBuild-allocated rows explicitly in stmtPatch (aHeapRows) and free
them before tDestroySubmitReq instead of inferring heap vs slab by address
range, fixing invalid free of decoded-in-place SRow pointers.
Use asyncQueryCb for internal async retries so the user asyncExecFn runs once
with the final result; invoke the user callback when retry setup fails; remove
asyncQueryCbRetry.
In stmt2Case.exec_retry, accept NULL for backfilled INT columns after ALTER
ADD COLUMN (taos_fetch_row uses null pointers for SQL NULL).
Copilot AI review requested due to automatic review settings April 14, 2026 09:53
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an internal retry mechanism for the STscStmt2 component to handle errors such as metadata refreshes and schema version mismatches. It introduces functionality to save and restore serialized data blocks, rebuild rows to match updated schemas, and update metadata from the catalog in both synchronous and asynchronous execution paths. Review feedback identifies a memory leak in the asynchronous callback where the original request object is not freed when a retry is successfully initiated. Additionally, the use of arbitrary sleeps in test cases is flagged as a potential indicator of underlying synchronization issues that should be addressed more deterministically.

Comment thread source/client/src/clientStmt2.c Outdated
Comment thread source/client/test/stmt2Test.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the stmt2 retry strategy to make retries safer and more deterministic: it snapshots serialized submit blocks for NEED_CLIENT_HANDLE_ERROR retries, patches schema/table meta for retry execution, and adjusts async retry handling so the user async callback is intended to run only once with the final outcome. It also extends stmt2 tests to cover schema/meta changes between bind and exec.

Changes:

  • Add serialization snapshot/restore support for vnode modify data blocks and patching logic (schema ver + table uid/suid/sver) for internal retries.
  • Update async retry flow to retry internally and aim to invoke the user callback only once with the final result.
  • Expand stmt2Case.exec_retry test coverage for ALTER/DROP/RECREATE scenarios and adapter query timing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
source/client/src/clientStmt2.c Implements saved submit-block cloning/restoration and schema/meta patching for retry; modifies sync + async retry flows; adds stmtErrstr2 alias.
source/client/inc/clientStmt2.h Adds pVgDataBlocksForRetry to persist retry payload snapshots across planner ownership changes.
source/client/test/stmt2Test.cpp Extends stmt2 retry tests and adds timing delay in adapter async query test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/client/test/stmt2Test.cpp Outdated
Comment thread source/client/test/stmt2Test.cpp Outdated
Comment thread source/client/test/stmt2Test.cpp
Comment thread source/client/test/stmt2Test.cpp
Comment thread source/client/src/clientStmt2.c
Comment thread source/client/src/clientStmt2.c Outdated
@JinqingKuang
Copy link
Copy Markdown
Contributor

代码审查报告 — PR #35139

共发现 7 个已验证问题(1 高危、2 中危、4 低危)。


🔴 高危

1. stmtPatchOneSubmitTbDataSchemaVeraHeapRows 入队失败后 schema 重建被静默丢弃

位置source/client/src/clientStmt2.c:937-938

aHeapRows != NULLtaosArrayPush(aHeapRows, &pNew) 失败时,pNewtRowDestroy 释放,但 else 分支(负责调用 taosArraySet 将重建后的行写回 pTb->aRowP[i]不执行pRow->sver未更新。结果:该行以旧 schema version 原样重新序列化,server 将以相同的 schema 版本错误再次拒绝请求,重试逻辑失效。

修复方向:push 失败时退回 in-place 更新——pRow->sver = (uint16_t)pMeta->sversion;(与 HAS_BLOB 分支的处理方式一致)。


🟡 中危

2. stmtBuildUidToTableMetaHashtaosHashPut 失败被静默忽略

位置source/client/src/clientStmt2.c:987, 1003

taosHashPut 返回错误时,pDup 被释放后函数继续执行并最终返回 TSDB_CODE_SUCCESS,hash 中缺失对应 uid 的条目。下游 stmtPatchOneSubmitTbDataSchemaVer 对该 uid 查 hash 时得到 NULL 直接返回,schema version patch 被跳过,重试请求仍携带旧版本,server 将再次报错。

修复方向taosHashPut 失败时返回错误码并向上传播;或至少记录警告日志,便于问题定位。

3. asyncQueryCb:重试 setup 部分成功后的失败路径存在状态风险

位置source/client/src/clientStmt2.c:3130-3158

stmtCreateRequest 成功但 SSqlCallbackWrapper 分配失败时,pStmt->exec.pRequest 已指向新建(未发出)的请求。代码随后进入"重试失败"通知路径并 fall-through 到 stmtCleanExecInfo,该函数以新建未初始化完全的请求作为参数运行。其清理行为依赖请求的具体状态,存在悬空引用风险(特别是 pStmt->exec.pRequest->inCallback = falsestmtCleanExecInfo 之后执行)。

修复方向:在 wrapper 分配失败时,显式关闭新建的 pRequest(如 taos_free_result),再继续错误回调路径;或在 stmtCreateRequest 失败时跳过重试阶段。


🔵 低危

4. asyncQueryCb:遗留注释代码未清理

位置source/client/src/clientStmt2.c:3111-3114

三行注释掉的代码(放弃的"先通知用户再重试"设计)不应出现在 production 代码中,请在合并前移除。

5. stmtFindDecodeSchemaForRow:O(n²) schema 前缀探测

位置source/client/src/clientStmt2.c:792-816

当行的 sver 与 catalog 不匹配时,从 nMax 逐一递减到 1,每轮调用 tBuildTSchema(含内存分配)并探测所有 n 列,时间复杂度 O(nMax²/2)。对于有数百列的宽表批量重试,recovery 时间可能显著变长。

修复方向:对前缀列数 n 使用二分查找,将复杂度降至 O(nMax·log(nMax))。

6. stmtFetchOneRetryTbMetaPatch path 2:O(N×M×L) 目录查询

stmtUpdateVgDataBlocksTbMetaFromCatalog 对每对 (vg block, table) 调用 stmtFetchOneRetryTbMetaPatch,而 path 2 内部每次都重新遍历整个 tableList 发起目录查询,总查询次数 O(nBlk × nTb × nList)。相比之下,schema version 路径的 stmtBuildUidToTableMetaHash 预先一次性构建 uid→meta hash(O(nList)),设计一致性较差。

修复方向:在 vg block 循环前预建 uid→meta hash,复用 stmtBuildUidToTableMetaHash 的策略。

7. 异步 exec_retry 测试缺少结果等待

位置source/client/test/stmt2Test.cpp:4219-4222

使用 stmtAsyncQueryCb2 的异步测试块在 taos_stmt2_exec 后直接执行 taos_query 验证结果,但 stmtAsyncQueryCb2 不发送任何信号量。参照同文件 line 4312–4339 的 drop-table 异步测试模式,taos_stmt2_exec 为非阻塞调用,此处查询可能在写入完成前执行,结果验证不可靠(潜在 flaky test)。

修复方向:改用带信号量的 stmtAsyncQueryCb + tsem_wait(参照 line 4312、4339),或在 stmtAsyncQueryCb2 中加入同步机制。


Claude Code code-review skill 自动生成

Comment thread source/client/src/clientStmt2.c
Comment thread source/client/src/clientStmt2.c Outdated
Comment thread source/client/src/clientStmt2.c Outdated
Comment thread source/client/src/clientStmt2.c
Comment thread source/client/test/stmt2Test.cpp
Comment thread source/client/src/clientStmt2.c Outdated
Copilot AI review requested due to automatic review settings April 16, 2026 03:18
@Pengrongkun Pengrongkun force-pushed the enh/main/6944442785 branch from 342b201 to 0fedab0 Compare April 16, 2026 03:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/client/test/stmt2Test.cpp Outdated
Comment thread source/client/src/clientStmt2.c
Comment thread source/client/src/clientStmt2.c
Comment thread source/client/test/stmt2Test.cpp
Comment thread source/client/test/stmt2Test.cpp
@Pengrongkun Pengrongkun force-pushed the enh/main/6944442785 branch from 0fedab0 to 61f94cf Compare April 16, 2026 05:47
Copilot AI review requested due to automatic review settings April 20, 2026 11:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +977 to +996
if (c != TSDB_CODE_SUCCESS) {
if (pMeta != NULL) {
taosMemoryFree(pMeta);
}
taosHashCleanup(pHash);
return c;
}
if (pMeta != NULL) {
STableMeta* pDup = stmtCloneTableMetaForRetry(pMeta);
taosMemoryFree(pMeta);
pMeta = NULL;
if (pDup != NULL) {
int32_t putCode = taosHashPut(pHash, &pDup->uid, sizeof(uint64_t), &pDup, POINTER_BYTES);
if (putCode != TSDB_CODE_SUCCESS) {
STMT2_ELOG("stmtBuildUidToTableMetaHash taosHashPut failed uid:%" PRIu64 ", code:%s", (uint64_t)pDup->uid,
tstrerror(putCode));
taosMemoryFree(pDup);
taosHashCleanup(pHash);
return putCode;
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On several error paths this function calls taosHashCleanup(pHash) without first freeing the heap-allocated STableMeta* values already inserted into the hash. Since the hash stores pointer bytes (not owning/freeing the pointed-to metas), this leaks memory when catalogGetTableMeta or taosHashPut fails. Consider using stmtFreeUidTableMetaHash(pHash) (or equivalent iteration) on these early returns so inserted metas are freed before cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +1206 to +1231
int32_t nList = (int32_t)taosArrayGetSize(pRequest->tableList);
int32_t nonStbOrd = 0;
for (int32_t li = 0; li < nList; ++li) {
SName* pName = taosArrayGet(pRequest->tableList, li);
STableMeta* pMeta = NULL;
int32_t c = catalogGetTableMeta(pStmt->pCatalog, &conn, pName, &pMeta);
if (c != TSDB_CODE_SUCCESS) {
taosMemoryFreeClear(pMeta);
return c;
}
if (pMeta == NULL) {
return TSDB_CODE_INTERNAL_ERROR;
}
if (stmtRetryTbMetaIsSuperTable(pMeta)) {
taosMemoryFree(pMeta);
continue;
}
if (nonStbOrd == tbIdx) {
pPatch->uid = pMeta->uid;
pPatch->suid = pMeta->suid;
pPatch->sver = pMeta->sversion;
taosMemoryFree(pMeta);
return TSDB_CODE_SUCCESS;
}
taosMemoryFree(pMeta);
nonStbOrd++;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stmtFetchOneRetryTbMetaPatch does a full scan of pRequest->tableList and calls catalogGetTableMeta for each entry. Since stmtUpdateVgDataBlocksTbMetaFromCatalog calls this once per SSubmitTbData, retries with many tables become O(nSubmitTb * nTableList) catalog RPCs. Consider pre-building an index (e.g., array of non-super-table metas/patches aligned to tbIdx, or a uid/name->meta hash) once per retry and reusing it for all tables in the submit block.

Suggested change
int32_t nList = (int32_t)taosArrayGetSize(pRequest->tableList);
int32_t nonStbOrd = 0;
for (int32_t li = 0; li < nList; ++li) {
SName* pName = taosArrayGet(pRequest->tableList, li);
STableMeta* pMeta = NULL;
int32_t c = catalogGetTableMeta(pStmt->pCatalog, &conn, pName, &pMeta);
if (c != TSDB_CODE_SUCCESS) {
taosMemoryFreeClear(pMeta);
return c;
}
if (pMeta == NULL) {
return TSDB_CODE_INTERNAL_ERROR;
}
if (stmtRetryTbMetaIsSuperTable(pMeta)) {
taosMemoryFree(pMeta);
continue;
}
if (nonStbOrd == tbIdx) {
pPatch->uid = pMeta->uid;
pPatch->suid = pMeta->suid;
pPatch->sver = pMeta->sversion;
taosMemoryFree(pMeta);
return TSDB_CODE_SUCCESS;
}
taosMemoryFree(pMeta);
nonStbOrd++;
static _Thread_local SRequestObj* sRetryPatchCacheReq = NULL;
static _Thread_local void* sRetryPatchCacheTableList = NULL;
static _Thread_local int32_t sRetryPatchCacheListSize = -1;
static _Thread_local SArray* sRetryPatchCache = NULL;
int32_t nList = (int32_t)taosArrayGetSize(pRequest->tableList);
if (sRetryPatchCacheReq != pRequest || sRetryPatchCacheTableList != pRequest->tableList ||
sRetryPatchCacheListSize != nList) {
if (sRetryPatchCache != NULL) {
taosArrayDestroy(sRetryPatchCache);
sRetryPatchCache = NULL;
}
sRetryPatchCache = taosArrayInit(nList > 0 ? nList : 1, sizeof(SRetryTbMetaPatch));
if (sRetryPatchCache == NULL) {
return TAOS_GET_TERRNO(TSDB_CODE_OUT_OF_MEMORY);
}
for (int32_t li = 0; li < nList; ++li) {
SName* pName = taosArrayGet(pRequest->tableList, li);
STableMeta* pMeta = NULL;
int32_t c = catalogGetTableMeta(pStmt->pCatalog, &conn, pName, &pMeta);
if (c != TSDB_CODE_SUCCESS) {
taosMemoryFreeClear(pMeta);
taosArrayDestroy(sRetryPatchCache);
sRetryPatchCache = NULL;
sRetryPatchCacheReq = NULL;
sRetryPatchCacheTableList = NULL;
sRetryPatchCacheListSize = -1;
return c;
}
if (pMeta == NULL) {
taosArrayDestroy(sRetryPatchCache);
sRetryPatchCache = NULL;
sRetryPatchCacheReq = NULL;
sRetryPatchCacheTableList = NULL;
sRetryPatchCacheListSize = -1;
return TSDB_CODE_INTERNAL_ERROR;
}
if (stmtRetryTbMetaIsSuperTable(pMeta)) {
taosMemoryFree(pMeta);
continue;
}
SRetryTbMetaPatch patch = {0};
patch.uid = pMeta->uid;
patch.suid = pMeta->suid;
patch.sver = pMeta->sversion;
taosMemoryFree(pMeta);
if (taosArrayPush(sRetryPatchCache, &patch) == NULL) {
taosArrayDestroy(sRetryPatchCache);
sRetryPatchCache = NULL;
sRetryPatchCacheReq = NULL;
sRetryPatchCacheTableList = NULL;
sRetryPatchCacheListSize = -1;
return TAOS_GET_TERRNO(TSDB_CODE_OUT_OF_MEMORY);
}
}
sRetryPatchCacheReq = pRequest;
sRetryPatchCacheTableList = pRequest->tableList;
sRetryPatchCacheListSize = nList;
}
if (tbIdx >= 0 && tbIdx < (int32_t)taosArrayGetSize(sRetryPatchCache)) {
SRetryTbMetaPatch* pCachedPatch = taosArrayGet(sRetryPatchCache, tbIdx);
if (pCachedPatch != NULL) {
*pPatch = *pCachedPatch;
return TSDB_CODE_SUCCESS;
}

Copilot uses AI. Check for mistakes.
@JinqingKuang JinqingKuang changed the title enhstmt2): refactoring stmt2 retry strategy enh(stmt2): refactoring stmt2 retry strategy Apr 21, 2026
@guanshengliang guanshengliang merged commit c69544e into main Apr 21, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants