Skip to content

Optimization for strided load in store daemons#161

Merged
wolegechu merged 5 commits into
mainfrom
yuhan/daemon-strided-load
Mar 12, 2026
Merged

Optimization for strided load in store daemons#161
wolegechu merged 5 commits into
mainfrom
yuhan/daemon-strided-load

Conversation

@zhou-yuhan
Copy link
Copy Markdown
Collaborator

  1. materialize 细粒度埋点与 breakdown。改动模块是 byte_range_mapped_source.cc。核心是在 ByteRangeMappedSource::read_at / fill_strided_run / read_base 全链路加了 VLOG(2) 分段计时,覆盖了:
  • read_base 读源耗时(成功/短读/失败)
  • strided 路径的 cache_lookup、block_prepare、block_load、pack、row_copy
  • 每个 run 的 begin/end + 汇总
  • read_at.summary 总结字段(strided_pack_us_total、strided_block_prepare_us_total、strided_block_load_us_total 等)
  1. block cache 复用,减少 block_prepare 抖动, 仍在 byte_range_mapped_source.cc。在 fill_strided_run -> load_block 的 miss 路径里做了两件事:
  • 当缓存块 use_count()==1 时复用同一个 StridedBlock,避免每次 miss
    make_shared
  • block->data 采用“只增不减”的 reserve + resize 策略复用容量,减少反复分
    配/释放
    这部分直接降低了 block_prepare 时间和内存分配开销。
  1. direct path(direct gather)绕开 block load + pack。核心逻辑也在 byte_range_mapped_source.cc 的 strided 分支。实现了满足条件时的 direct gather 路径:直接从源 CPU 基址按 row gather 到输出 buffer,不再走“读整块 -> 再 pack”。触发条件包含:
  • stride > row_len
  • row_len >= 4KB
  • 本次 copy >= 4MB
  • rows_touched 在阈值内
  • source 提供 cpu_base_ptr
    为让运行时 ingestion 的 local replica 也具备该能力,在
    materialization_facade.cc:319 的 LocalReplicaSource 增加了 cpu_base_ptr()透传。

SGLang tensorcast loader 性能结果:

  1. 同样是 Qwen3-14B, tp=2, tensorcast,Load weight 从约 11s 降到约 5s
  • 旧日志:2026-03-10 13:29:49 -> 13:30:00(约 11s)
  • 新日志:2026-03-11 17:37:19 -> 17:37:24(约 5s)
  • 约 2.2x 提升(~54.5% 降低)
  1. daemon data plane 的关键阶段(consumer start -> production complete)
    明显缩短
  • 旧:约 6.8s
  • 新:约 3.7s
  • 约 45% 降低
  1. 从埋点字段看,direct gather 生效后,strided 的 block_load/pack 基本被
    消掉
  • 旧路径常见:strided_pack_us_total、strided_block_prepare_us_total、
    strided_block_load_us_total 均显著非零
  • 新路径常见:这些字段接近 0,转为 strided_row_copy_us_total 主导

@zhou-yuhan zhou-yuhan requested a review from wolegechu March 11, 2026 14:39
@zhou-yuhan zhou-yuhan marked this pull request as ready for review March 11, 2026 14:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ffe9e5ee5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +143 to +145
base_versions = {k: v.split("+")[0] for k, v in versions.items()}
unique_base_versions = set(base_versions.values())
is_consistent = len(unique_base_versions) <= 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare full torch pins when validating version consistency

validate_torch_versions now strips local version tags before comparison, so mismatches like 2.8.0+cu128 vs 2.8.0+cpu are treated as consistent even though they target different runtime builds. tools/release.sh uses this boolean check to gate releases, so this change can let incompatible torch/CUDA combinations pass validation and ship a broken build path; the validator should keep checking full pinned versions (or explicitly validate local tags when present).

Useful? React with 👍 / 👎.

@wolegechu wolegechu merged commit 83d2114 into main Mar 12, 2026
2 of 4 checks passed
@wolegechu wolegechu deleted the yuhan/daemon-strided-load branch March 12, 2026 08:55
wolegechu pushed a commit that referenced this pull request Mar 12, 2026
* fix(dev): 1) fix torch version check; 2) sync bazel module sources with uv.lock

* fix(dev): explicitly add pytz as dependency to avoid import error in duckDB (e.g. with python>=3.11)

* fix(store): 1) add more logging points for strided load in byte_range_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load

* fix(store): add CPU memfd region in local replica handle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants