Optimization for strided load in store daemons by zhou-yuhan · Pull Request #161 · tensorcast-ai/tensorcast

zhou-yuhan · 2026-03-11T14:39:04Z

materialize 细粒度埋点与 breakdown。改动模块是 byte_range_mapped_source.cc。核心是在 ByteRangeMappedSource::read_at / fill_strided_run / read_base 全链路加了 VLOG(2) 分段计时，覆盖了：

read_base 读源耗时（成功/短读/失败）
strided 路径的 cache_lookup、block_prepare、block_load、pack、row_copy
每个 run 的 begin/end + 汇总
read_at.summary 总结字段（strided_pack_us_total、strided_block_prepare_us_total、strided_block_load_us_total 等）

block cache 复用，减少 block_prepare 抖动, 仍在 byte_range_mapped_source.cc。在 fill_strided_run -> load_block 的 miss 路径里做了两件事：

当缓存块 use_count()==1 时复用同一个 StridedBlock，避免每次 miss
make_shared
block->data 采用“只增不减”的 reserve + resize 策略复用容量，减少反复分
配/释放
这部分直接降低了 block_prepare 时间和内存分配开销。

direct path（direct gather）绕开 block load + pack。核心逻辑也在 byte_range_mapped_source.cc 的 strided 分支。实现了满足条件时的 direct gather 路径：直接从源 CPU 基址按 row gather 到输出 buffer，不再走“读整块 -> 再 pack”。触发条件包含：

stride > row_len
row_len >= 4KB
本次 copy >= 4MB
rows_touched 在阈值内
source 提供 cpu_base_ptr
为让运行时 ingestion 的 local replica 也具备该能力，在
materialization_facade.cc:319 的 LocalReplicaSource 增加了 cpu_base_ptr()透传。

SGLang tensorcast loader 性能结果：

同样是 Qwen3-14B, tp=2, tensorcast，Load weight 从约 11s 降到约 5s

旧日志：2026-03-10 13:29:49 -> 13:30:00（约 11s）
新日志：2026-03-11 17:37:19 -> 17:37:24（约 5s）
约 2.2x 提升（~54.5% 降低）

daemon data plane 的关键阶段（consumer start -> production complete）
明显缩短

旧：约 6.8s
新：约 3.7s
约 45% 降低

从埋点字段看，direct gather 生效后，strided 的 block_load/pack 基本被
消掉

旧路径常见：strided_pack_us_total、strided_block_prepare_us_total、
strided_block_load_us_total 均显著非零
新路径常见：这些字段接近 0，转为 strided_row_copy_us_total 主导

…th uv.lock

…duckDB (e.g. with python>=3.11)

…_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ffe9e5ee5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-11T14:47:58Z

+    base_versions = {k: v.split("+")[0] for k, v in versions.items()}
+    unique_base_versions = set(base_versions.values())
+    is_consistent = len(unique_base_versions) <= 1


Compare full torch pins when validating version consistency

validate_torch_versions now strips local version tags before comparison, so mismatches like 2.8.0+cu128 vs 2.8.0+cpu are treated as consistent even though they target different runtime builds. tools/release.sh uses this boolean check to gate releases, so this change can let incompatible torch/CUDA combinations pass validation and ship a broken build path; the validator should keep checking full pinned versions (or explicitly validate local tags when present).

Useful? React with 👍 / 👎.

* fix(dev): 1) fix torch version check; 2) sync bazel module sources with uv.lock * fix(dev): explicitly add pytz as dependency to avoid import error in duckDB (e.g. with python>=3.11) * fix(store): 1) add more logging points for strided load in byte_range_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load * fix(store): add CPU memfd region in local replica handle

zhou-yuhan added 5 commits March 3, 2026 11:05

fix(dev): 1) fix torch version check; 2) sync bazel module sources wi…

3c0e391

…th uv.lock

fix(dev): explicitly add pytz as dependency to avoid import error in …

818d552

…duckDB (e.g. with python>=3.11)

Merge branch 'main' into yuhan/dev-fix

50e7eff

fix(store): 1) add more logging points for strided load in byte_range…

7bc9075

…_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load

fix(store): add CPU memfd region in local replica handle

5ffe9e5

zhou-yuhan requested a review from wolegechu March 11, 2026 14:39

zhou-yuhan marked this pull request as ready for review March 11, 2026 14:39

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

wolegechu merged commit 83d2114 into main Mar 12, 2026
2 of 4 checks passed

wolegechu deleted the yuhan/daemon-strided-load branch March 12, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for strided load in store daemons#161

Optimization for strided load in store daemons#161
wolegechu merged 5 commits into
mainfrom
yuhan/daemon-strided-load

zhou-yuhan commented Mar 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhou-yuhan commented Mar 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants