Support multi-source weights and HTTPS weight sources#3008
Conversation
|
Based on my comprehensive review, I found no actionable issues with this PR. The implementation is solid, well-tested, and follows established patterns. Summary of changes:
Verification:
LGTM |
6929793 to
dbd9722
Compare
|
The test failure in Based on my comprehensive review of the PR: Summary of changes:
Verification:
LGTM |
Weights can now declare multiple sources that merge into a single target directory (last-in-wins for path conflicts). This covers the common case of pulling .pth files from multiple GitHub release URLs into one weights dir. Config: source field accepts an object or array via custom unmarshalers. Lockfile: Source -> Sources[] with ImportedAt moved to entry level. WeightSpec: URI/Include/Exclude -> Sources[]SourceSpec. New HTTPSource for https:// and http:// URIs with ETag fingerprinting. Multi-source merge in resolveInventory with per-file owner tracking.
- Reject HTTP URIs with embedded credentials in NewHTTPSource and normalizeHTTPURI; the URI is recorded in weights.lock and would leak via git. Extract parseHTTPURI helper so the rule lives in one place. - Treat weak ETags (W/"...") as no ETag — they don't promise content identity per RFC 7232 §2.3. Document ETag as a cache hint, not a fingerprint. - Log HEAD failures and weak-ETag fallbacks at debug level. - Add MarshalYAML on WeightSourceList for symmetry with MarshalJSON; without it a single-source weight round-trips through YAML as a one-element sequence instead of a mapping. - Add a "source is required" check in validateWeights so users see an actionable error instead of the schema's "must be a mapping". - Surface multi-source path collisions at info level — silent shadowing is easy to miss in cog.yaml diffs. - Show "(+N more)" indicator in cog weights status for multi-source weights so the truncation is visible. - Render multi-source plan output on multiple lines instead of joining URIs with commas (URLs can contain commas). - Drop the unused CombineFingerprints helper and merged.Fingerprint field; per-source fingerprints are tracked separately and used wherever drift matters.
Every existing weight test instantiated a single-element source list, leaving the multi-source merge, override, and per-source fingerprint paths exercised only in production. Add tests for: - merging disjoint files from two sources into one inventory and resolving overlapping paths last-in-wins (table-driven, since the two cases share setup shape) - per-source fingerprint drift (mutating one source leaves the other's fingerprint stable) - PlanImport surfacing drift on a specific source index - YAML and JSON parsing of the array form of source: - MarshalYAML/MarshalJSON shape contract (mapping for single, sequence/array for multi) with round-trip through unmarshal - validateWeights accepting the multi-source form and attributing pattern errors to the right source index The new tests live alongside the existing single-source tests they parallel — multi-source isn't a separate concept, just a different length of the source list.
dbd9722 to
8a67b5a
Compare
|
I'm Bonk, and I've done a quick review of your PR. Summary of changes: Adds multi-source weight support with last-in-wins merging and HTTPS/HTTP weight sources with ETag-based fingerprinting. Issues found:
|
| // that NewHTTPSource and normalizeHTTPURI cannot drift apart. The | ||
| // "no userinfo" rule exists because the URI lands in weights.lock — | ||
| // see NewHTTPSource for the full rationale. | ||
| func parseHTTPURI(uri string) (*url.URL, error) { |
There was a problem hiding this comment.
Query-string credentials like presigned S3/GCS/GitHub tokens are accepted and can be written to weights.lock or shown in output. Fix by separating fetch URL from persisted/display URL and redacting or rejecting credential-bearing query params.
There was a problem hiding this comment.
good point! I don't think we should block on that, but worth coming back to. I'll make an issue!
| if err != nil { | ||
| return nil, fmt.Errorf("invalid HTTP URI %q: %w", uri, err) | ||
| } | ||
| if parsed.Scheme != "https" && parsed.Scheme != "http" { |
There was a problem hiding this comment.
should we instead default to only allow HTTPS with maybe insecure opt-in for localhost/dev cases?
There was a problem hiding this comment.
not sure what we'd gain by restricting it atm. we aren't running untrusted code, so there's not the same risk as eg docker where spoofing or mitm could serve different content. this is just fetching and unpacking blobs
Add Go fuzz tests for config parsing (FuzzParseBytes, FuzzFromYAML, FuzzConfigComplete) and fix two categories of crashes found: 1. version.Matches/Equal/Greater panicked on invalid version strings like "11.A". Replace MustVersion with graceful NewVersion handling. 2. yaml v4 LoadErrors created infinite recursion when errors.Is traversed wrapped errors. Use %v instead of %w when wrapping yaml parse errors to break the self-referential cycle. Also validate CUDA minor version in validateCudaVersion().
|
I'm Bonk, and I've done a quick review of your PR. Summary of changes: Adds multi-source weight support with last-in-wins merging and HTTPS/HTTP weight sources with ETag-based fingerprinting. Issues found:
|
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
Summary of changes: Adds multi-source weight support with last-in-wins merging and HTTPS/HTTP weight sources with ETag-based fingerprinting.
Issues found:
- Test coverage gap: The HTTP source handles weak ETags (RFC 7232 §2.3) by ignoring them and falling back to SHA256 hashing, but this behavior isn't tested. The implementation in
http.go:199-201correctly handlesW/"..."prefixes, buthttp_test.golacks a test case for this scenario.
|
Posted a review comment on |
|
LGTM |
Weights can now declare multiple sources that merge into a single target directory, and a new
https:///http://source covers GitHub Releases, ONNX Model Zoo, and similar plain-HTTP downloads.The driving case is models that pull
.pthfiles from a few GitHub release URLs into one weights dir alongside a HuggingFace base. Today that needs a wrapper script; with this PR it's:Path conflicts resolve last-in-wins (declaration order = layering order, like CSS) and surface at info level so silent shadowing doesn't slip into a cog.yaml diff unnoticed.
Notes for reviewers
source(object) →sources(array), withimportedAtmoved up to the entry level. Version stays at1— managed weights are still experimental, real users are minimal, and a forced re-import on upgrade is preferable to compat shims. The lockfile-version story will get revisited as part of the broader cog ecosystem compatibility-date plan.weights.lock(which is checked into git), so embeddeduser:pass@hostwould leak. Auth-header support is a separate follow-up.W/"...") are ignored. The store still verifies the sha256 onPutFile, so a mutable URL serving different bytes between inventory and ingress fails loudly.source: { ... }andsource: [ ... ]both parse, with customMarshal/Unmarshalfor both YAML and JSON keeping single-source weights round-tripping as a mapping.No integration test for HTTPS — would need either a fixture server in the testscript harness or live network calls. The unit tests on
httptestcover the HTTP source thoroughly; happy to add aweights_multisource.txtarcovering twofile://sources as a follow-up if you want.