gomodfs: implement synthetic .zip file serving via synthzip#26
gomodfs: implement synthetic .zip file serving via synthzip#26
Conversation
Implement the $base/$module/@v/$version.zip endpoint from the goproxy protocol, using github.com/tomhjp/synthzip to synthesize zip files on-the-fly from individually-cached files without materializing the entire archive in memory upfront. synthzip requires Name, Size, and CRC32 for each file at construction time (no I/O), then lazily reads file contents only when a byte range is actually accessed. Key changes: - store: Add CRC32 to PutFile interface, add ZipFileEntry type and GetZipFileEntries method to Store interface - gitstore: Store CRC32 data as a "zipcrc" blob in the git tree during PutModule (format: "<8-hex-crc32>\t<path>\n" per file). Load it back in newModHandle. Old cache entries without zipcrc get CRC32=0. - gomodfs: Add getZipArchive (builds synthzip.Archive from metadata), getZipFileData (materializes full zip), getZipFileSize (metadata only) - path: Recognize ".zip" extension in parsePath - All frontends (NFS, FUSE, WebDAV, WinFsp): Serve .zip files. Stat calls use getZipFileSize to avoid materializing the full zip.
tomhjp
left a comment
There was a problem hiding this comment.
Thanks! There are a few design decisions that are interesting to discuss/document, but none of them are blockers.
| } | ||
|
|
||
| // Try to load CRC32 data from the zipcrc blob. | ||
| if crcObj, err := s.getObject(context.TODO(), modTree.String()+":zipcrc", "blob"); err == nil { |
| Blob objRef | ||
| Size int64 | ||
| Mode os.FileMode | ||
| CRC32 uint32 // CRC-32 checksum; 0 if not known (old cache entries) |
There was a problem hiding this comment.
If you care to, I think we could avoid the requirement to track CRC-32s with some library updates. It turns out most zip readers (including go's archive/zip) never cross-check the local file header CRC-32 against the CRC-32 stored in the central directory, so we should be able to zero out the CRC-32 in the central directory and only need to lazily calculate it in a trailer per-file as readers iterate through real file contents.
Unfortunately readers need to guess at the range they'll find the central directory marker within, so e.g. archive/zip reads the last 1KiB upfront (if that fails its second guess is the last 65KiB), which will inevitably include some real file content regions, but that's a small enough amount of duplicate file reading that it seems like a good trade-off for a simpler interface.
That said, given the work we already do on the way into the git store, it seems very reasonable to avoid all that complexity and just track the CRC-32s. I'm probably leaning slightly towards tracking them in gomodfs as you've got it already, WDYT?
| } | ||
|
|
||
| // ZipFileEntry describes a file within a module's zip for synthetic zip construction. | ||
| type ZipFileEntry struct { |
There was a problem hiding this comment.
We could use synthzip.File directly, but I guess you didn't want to require the library as part of the interface definition? Which seems reasonable, just thinking aloud.
| data := make([]byte, archive.Size()) | ||
| if _, err := archive.ReadAt(data, 0); err != nil && err != io.EOF { |
| }) | ||
| } | ||
|
|
||
| // getZipFileData returns the full zip file bytes for the given module version. |
There was a problem hiding this comment.
Have you seen any pathologically large module zips? I had kind of assumed we'd want to make some interface changes to support streaming rather than reading the whole thing into memory, but I don't have a good feel for the worst-case we see in practice.
|
|
||
| // Read the synthetic .zip file. | ||
| zipPath := "cache/download/go4.org/mem/@v/v0.0.0-20240501181205-ae6ca9944745.zip" | ||
| ctx := context.Background() |
| } | ||
| } | ||
|
|
||
| func TestZipFile(t *testing.T) { |
| } | ||
| } | ||
| } | ||
| // If zipcrc doesn't exist (old cache entries), CRC32 fields stay 0. |
There was a problem hiding this comment.
Maybe expand this comment to explain the implications downstream? IIUC we'll then error when the zip gets constructed and end up reporting the zip doesn't exist, so essentially the existing behaviour we have today?
| // Format: "<8-hex-crc32>\t<path>\n" | ||
| var hasCRC bool | ||
| for _, f := range data.Files { | ||
| if f.CRC32() != 0 { |
There was a problem hiding this comment.
As this is accessing data from the original zip file, the only way this could be zero is if the CRC-32 appears in the data descriptor (trailing fields after the file content) instead of the file header. If we get zip files of that form, we'll fail to extract the CRC-32. Maybe we don't care because it's always the same tool creating the zips so we probably won't see zips of that form? But if we want to catch that, we'd have to write our own data descriptor parsing logic because unfortunately the standard library doesn't expose it AFAICT.
However, we are also about to read the entire file contents in tb.addFile down below, so could calculate it on the fly very cheaply and skip any complex zip parsing.
Anyway, all that to say, there are a few alternatives to consider here, and quite a lot of subtlety behind this check. At the least, probably worth documenting the choices in more detail with a comment.
| fmt.Fprintf(&crcBuf, "%08x\t%s\n", f.CRC32(), f.Path()) | ||
| } | ||
| crcData := crcBuf.Bytes() | ||
| if err := tb.addFile("zipcrc", static(crcData), 0644); err != nil { |
There was a problem hiding this comment.
Would we get better git store deduplication across similar module versions if we had a zipcrc blob per path? Was the decision here intentional to reduce the number of blobs? I can see it would be a lot of extra blobs to do one CRC-32 per blob, but not sure what kind of impact that would have on performance.
Implement the $base/$module/@v/$version.zip endpoint from the goproxy
protocol, using github.com/tomhjp/synthzip to synthesize zip files
on-the-fly from individually-cached files without materializing the
entire archive in memory upfront.
synthzip requires Name, Size, and CRC32 for each file at construction
time (no I/O), then lazily reads file contents only when a byte range
is actually accessed.
Key changes:
GetZipFileEntries method to Store interface
PutModule (format: "<8-hex-crc32>\t\n" per file). Load it back
in newModHandle. Old cache entries without zipcrc get CRC32=0.
getZipFileData (materializes full zip), getZipFileSize (metadata only)
calls use getZipFileSize to avoid materializing the full zip.