Skip to content

gomodfs: implement synthetic .zip file serving via synthzip#26

Open
bradfitz wants to merge 1 commit intomainfrom
bradfitz/synthzip
Open

gomodfs: implement synthetic .zip file serving via synthzip#26
bradfitz wants to merge 1 commit intomainfrom
bradfitz/synthzip

Conversation

@bradfitz
Copy link
Copy Markdown
Member

@bradfitz bradfitz commented Apr 8, 2026

Implement the $base/$module/@v/$version.zip endpoint from the goproxy
protocol, using github.com/tomhjp/synthzip to synthesize zip files
on-the-fly from individually-cached files without materializing the
entire archive in memory upfront.

synthzip requires Name, Size, and CRC32 for each file at construction
time (no I/O), then lazily reads file contents only when a byte range
is actually accessed.

Key changes:

  • store: Add CRC32 to PutFile interface, add ZipFileEntry type and
    GetZipFileEntries method to Store interface
  • gitstore: Store CRC32 data as a "zipcrc" blob in the git tree during
    PutModule (format: "<8-hex-crc32>\t\n" per file). Load it back
    in newModHandle. Old cache entries without zipcrc get CRC32=0.
  • gomodfs: Add getZipArchive (builds synthzip.Archive from metadata),
    getZipFileData (materializes full zip), getZipFileSize (metadata only)
  • path: Recognize ".zip" extension in parsePath
  • All frontends (NFS, FUSE, WebDAV, WinFsp): Serve .zip files. Stat
    calls use getZipFileSize to avoid materializing the full zip.

Implement the $base/$module/@v/$version.zip endpoint from the goproxy
protocol, using github.com/tomhjp/synthzip to synthesize zip files
on-the-fly from individually-cached files without materializing the
entire archive in memory upfront.

synthzip requires Name, Size, and CRC32 for each file at construction
time (no I/O), then lazily reads file contents only when a byte range
is actually accessed.

Key changes:

- store: Add CRC32 to PutFile interface, add ZipFileEntry type and
  GetZipFileEntries method to Store interface
- gitstore: Store CRC32 data as a "zipcrc" blob in the git tree during
  PutModule (format: "<8-hex-crc32>\t<path>\n" per file). Load it back
  in newModHandle. Old cache entries without zipcrc get CRC32=0.
- gomodfs: Add getZipArchive (builds synthzip.Archive from metadata),
  getZipFileData (materializes full zip), getZipFileSize (metadata only)
- path: Recognize ".zip" extension in parsePath
- All frontends (NFS, FUSE, WebDAV, WinFsp): Serve .zip files. Stat
  calls use getZipFileSize to avoid materializing the full zip.
@bradfitz bradfitz requested a review from tomhjp April 8, 2026 04:27
Copy link
Copy Markdown
Member

@tomhjp tomhjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There are a few design decisions that are interesting to discuss/document, but none of them are blockers.

}

// Try to load CRC32 data from the zipcrc blob.
if crcObj, err := s.getObject(context.TODO(), modTree.String()+":zipcrc", "blob"); err == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not context.Background()?

Blob objRef
Size int64
Mode os.FileMode
CRC32 uint32 // CRC-32 checksum; 0 if not known (old cache entries)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care to, I think we could avoid the requirement to track CRC-32s with some library updates. It turns out most zip readers (including go's archive/zip) never cross-check the local file header CRC-32 against the CRC-32 stored in the central directory, so we should be able to zero out the CRC-32 in the central directory and only need to lazily calculate it in a trailer per-file as readers iterate through real file contents.

Unfortunately readers need to guess at the range they'll find the central directory marker within, so e.g. archive/zip reads the last 1KiB upfront (if that fails its second guess is the last 65KiB), which will inevitably include some real file content regions, but that's a small enough amount of duplicate file reading that it seems like a good trade-off for a simpler interface.

That said, given the work we already do on the way into the git store, it seems very reasonable to avoid all that complexity and just track the CRC-32s. I'm probably leaning slightly towards tracking them in gomodfs as you've got it already, WDYT?

Comment thread store/store.go
}

// ZipFileEntry describes a file within a module's zip for synthetic zip construction.
type ZipFileEntry struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use synthzip.File directly, but I guess you didn't want to require the library as part of the interface definition? Which seems reasonable, just thinking aloud.

Comment thread gomodfs.go
Comment on lines +493 to +494
data := make([]byte, archive.Size())
if _, err := archive.ReadAt(data, 0); err != nil && err != io.EOF {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or io.ReadAll if you prefer

Comment thread gomodfs.go
})
}

// getZipFileData returns the full zip file bytes for the given module version.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen any pathologically large module zips? I had kind of assumed we'd want to make some interface changes to support streaming rather than reading the whole thing into memory, but I don't have a good feel for the worst-case we see in practice.

Comment thread nfs_test.go

// Read the synthetic .zip file.
zipPath := "cache/download/go4.org/mem/@v/v0.0.0-20240501181205-ae6ca9944745.zip"
ctx := context.Background()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: t.Context()?

Comment thread nfs_test.go
}
}

func TestZipFile(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

}
}
}
// If zipcrc doesn't exist (old cache entries), CRC32 fields stay 0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand this comment to explain the implications downstream? IIUC we'll then error when the zip gets constructed and end up reporting the zip doesn't exist, so essentially the existing behaviour we have today?

// Format: "<8-hex-crc32>\t<path>\n"
var hasCRC bool
for _, f := range data.Files {
if f.CRC32() != 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is accessing data from the original zip file, the only way this could be zero is if the CRC-32 appears in the data descriptor (trailing fields after the file content) instead of the file header. If we get zip files of that form, we'll fail to extract the CRC-32. Maybe we don't care because it's always the same tool creating the zips so we probably won't see zips of that form? But if we want to catch that, we'd have to write our own data descriptor parsing logic because unfortunately the standard library doesn't expose it AFAICT.

However, we are also about to read the entire file contents in tb.addFile down below, so could calculate it on the fly very cheaply and skip any complex zip parsing.

Anyway, all that to say, there are a few alternatives to consider here, and quite a lot of subtlety behind this check. At the least, probably worth documenting the choices in more detail with a comment.

fmt.Fprintf(&crcBuf, "%08x\t%s\n", f.CRC32(), f.Path())
}
crcData := crcBuf.Bytes()
if err := tb.addFile("zipcrc", static(crcData), 0644); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we get better git store deduplication across similar module versions if we had a zipcrc blob per path? Was the decision here intentional to reduce the number of blobs? I can see it would be a lot of extra blobs to do one CRC-32 per blob, but not sure what kind of impact that would have on performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants