Detect binary files in diff with size delta and IsBinary flag#44
Conversation
Binary files in git diffs previously showed as blank (no content) because ParseUnifiedDiff never found hunk headers for them. Now detects git's "Binary files ... differ" marker and returns a placeholder enriched with size info from git diff --stat (e.g. "(binary file: 1.0 KB → 2.0 KB)").
…diffs Replace fragile string comparison (Content == BinaryPlaceholder) with an explicit IsBinary bool on DiffLine. ParseUnifiedDiff now tracks "new file mode" / "deleted file mode" from diff headers so the placeholder content is correct even when git diff --stat fails (e.g. non-English locale). Also clarifies locale assumptions in binaryStatRe and binaryFilesRe comments and uses binaryChangeKind for accurate create/delete detection when stat parsing succeeds.
There was a problem hiding this comment.
Pull request overview
Adds first-class binary diff detection/representation so binary file changes render as a single placeholder line (optionally enriched with a human-readable size delta), rather than being skipped or producing unreadable output.
Changes:
- Introduces
DiffLine.IsBinaryand a sharedBinaryPlaceholderconstant to reliably mark binary placeholders independent of display text. - Enhances
ParseUnifiedDiffto detectBinary files ... differand to label new/deleted binary files based on diff headers. - Enriches binary placeholders in
Git.FileDiffwith size delta info parsed fromgit diff --stat, plus new/modified/deleted classification viagit diff --summary, with added fixtures/tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| diff/diff.go | Adds IsBinary, binary detection in unified diff parsing, and size-delta enrichment for binary diffs. |
| diff/fallback.go | Marks disk-detected binary placeholders with IsBinary and uses the shared placeholder constant. |
| diff/diff_test.go | Adds coverage for binary parsing, size formatting, and end-to-end binary FileDiff behavior. |
| diff/fallback_test.go | Extends fallback binary test to assert IsBinary is set. |
| diff/testdata/binary.diff | Adds a fixture representing a binary unified diff line emitted by git. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return ParseUnifiedDiff(out) | ||
| lines, err := ParseUnifiedDiff(out) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
In FileDiff, ParseUnifiedDiff errors are returned without any additional context (e.g., which file/ref was being parsed). This makes failures harder to debug compared to other call sites that wrap errors with context; consider wrapping the error with the file/ref info ("parse diff for %s" etc.).
| return nil, err | |
| return nil, fmt.Errorf("parse diff for file %s (ref %q): %w", file, ref, err) |
diff/diff.go
Outdated
| func (g *Git) binaryChangeKind(ref, file string, staged bool) binaryChangeKind { | ||
| args := g.diffArgs(ref, staged) | ||
| args = append(args, "--summary", "--", file) | ||
|
|
||
| out, err := g.runGit(args...) | ||
| if err != nil { | ||
| return binaryChangeModified | ||
| } | ||
|
|
||
| return parseBinaryChangeKind(out) | ||
| } |
There was a problem hiding this comment.
binaryChangeKind returns binaryChangeModified when git diff --summary fails. When git diff --stat succeeds but --summary fails, this can overwrite an already-correct "(new binary file)" / "(deleted binary file)" header-derived label with a misleading "(binary file: 0 B → N B)" description. Consider preserving the header-derived kind, or using a safer fallback (e.g., if summary fails, don't override the placeholder content, or infer kind from the existing placeholder metadata).
| // BinaryPlaceholder is the content used for binary file placeholders. | ||
| // ParseUnifiedDiff returns this when git reports "Binary files ... differ". |
There was a problem hiding this comment.
The BinaryPlaceholder comment says ParseUnifiedDiff returns this when git reports "Binary files ... differ", but ParseUnifiedDiff can also return "(new binary file)" / "(deleted binary file)" for the same git message when it sees "new file mode" / "deleted file mode". Updating this comment to reflect those cases would prevent future confusion.
| // BinaryPlaceholder is the content used for binary file placeholders. | |
| // ParseUnifiedDiff returns this when git reports "Binary files ... differ". | |
| // BinaryPlaceholder is the generic content used for binary file placeholders. | |
| // ParseUnifiedDiff may return this for "Binary files ... differ", or the more | |
| // specific "(new binary file)" / "(deleted binary file)" when git also reports | |
| // "new file mode" / "deleted file mode". |
umputun
left a comment
There was a problem hiding this comment.
clean implementation, tests pass, linter clean, no scope creep. good graceful degradation for non-English locales.
one thing to consider: the binary detection itself is ~10 lines. the size enrichment (1.0 KB → 2.0 KB) adds ~150 lines of code + 2 extra git commands per binary file. the size info is useful but I wonder if it's worth the complexity. smth to think about, not a blocker.
issues:
parseBinaryStat,parseBinaryChangeKind,formatBinaryDesc,formatSizeare standalone functions but only called fromGitmethods. should be methods onGitto match the rest of the codebase--statand--summaryrun as separate git commands. could combine intogit diff --stat --summaryto save one process spawn per binary file- no direct unit test for
parseBinaryChangeKind, only tested through integration tests. the other helpers have table-driven tests, this one should too
The use case I'm targeting with this feature - as a part of a PR agent optimizes assets size, I want to be able to quickly check it visually, as a part of PR review process. Comes from past experience, I think in this case the trade-off is fair. |
umputun
left a comment
There was a problem hiding this comment.
thx for the use case explanation and addressing all the points. LGTM.
one minor thing from Copilot that's worth considering (not a blocker): if git diff --summary fails but --stat succeeds, parseBinaryChangeKind defaults to modified which could overwrite a correct header-derived "new/deleted" label. low probability but worth a defensive check.
Summary
(binary file: 1.0 KB → 2.0 KB))git diff --summaryand diff header parsingRationale
Previously, binary files in diffs would be silently skipped. This is a common pain point when reviewing commits that touch images, compiled assets, or other non-text files.
The implementation adds an explicit
IsBinaryfield toDiffLinerather than relying on string comparison against a placeholder constant. This decouples the detection signal from the display text, making the check robust against content enrichment (size info, new/deleted labels).Header-based detection (
new file mode/deleted file mode) ensures correct labeling even whengit diff --statoutput can't be parsed — for example, on systems with non-English git locale where tokens likeBinandbytesare localized.Screenshot