Optimization for strided load in store daemons#161
Conversation
…duckDB (e.g. with python>=3.11)
…_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ffe9e5ee5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| base_versions = {k: v.split("+")[0] for k, v in versions.items()} | ||
| unique_base_versions = set(base_versions.values()) | ||
| is_consistent = len(unique_base_versions) <= 1 |
There was a problem hiding this comment.
Compare full torch pins when validating version consistency
validate_torch_versions now strips local version tags before comparison, so mismatches like 2.8.0+cu128 vs 2.8.0+cpu are treated as consistent even though they target different runtime builds. tools/release.sh uses this boolean check to gate releases, so this change can let incompatible torch/CUDA combinations pass validation and ship a broken build path; the validator should keep checking full pinned versions (or explicitly validate local tags when present).
Useful? React with 👍 / 👎.
* fix(dev): 1) fix torch version check; 2) sync bazel module sources with uv.lock * fix(dev): explicitly add pytz as dependency to avoid import error in duckDB (e.g. with python>=3.11) * fix(store): 1) add more logging points for strided load in byte_range_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load * fix(store): add CPU memfd region in local replica handle
make_shared
配/释放
这部分直接降低了 block_prepare 时间和内存分配开销。
为让运行时 ingestion 的 local replica 也具备该能力,在
materialization_facade.cc:319 的 LocalReplicaSource 增加了 cpu_base_ptr()透传。
SGLang tensorcast loader 性能结果:
明显缩短
消掉
strided_block_load_us_total 均显著非零