Skip to content

Detect binary files in diff with size delta and IsBinary flag#44

Merged
umputun merged 3 commits intoumputun:masterfrom
melonamin:feature/binary-file-detection
Apr 6, 2026
Merged

Detect binary files in diff with size delta and IsBinary flag#44
umputun merged 3 commits intoumputun:masterfrom
melonamin:feature/binary-file-detection

Conversation

@melonamin
Copy link
Copy Markdown
Contributor

@melonamin melonamin commented Apr 6, 2026

Summary

  • Detect binary files in unified diffs and display a placeholder instead of garbled content
  • Show human-readable size deltas for binary changes (e.g. (binary file: 1.0 KB → 2.0 KB))
  • Distinguish new/deleted/modified binary files via git diff --summary and diff header parsing

Rationale

Previously, binary files in diffs would be silently skipped. This is a common pain point when reviewing commits that touch images, compiled assets, or other non-text files.

The implementation adds an explicit IsBinary field to DiffLine rather than relying on string comparison against a placeholder constant. This decouples the detection signal from the display text, making the check robust against content enrichment (size info, new/deleted labels).

Header-based detection (new file mode / deleted file mode) ensures correct labeling even when git diff --stat output can't be parsed — for example, on systems with non-English git locale where tokens like Bin and bytes are localized.

Screenshot

screenshot-2026-04-06_15-27-03

Binary files in git diffs previously showed as blank (no content) because
ParseUnifiedDiff never found hunk headers for them. Now detects git's
"Binary files ... differ" marker and returns a placeholder enriched with
size info from git diff --stat (e.g. "(binary file: 1.0 KB → 2.0 KB)").
…diffs

Replace fragile string comparison (Content == BinaryPlaceholder) with an
explicit IsBinary bool on DiffLine. ParseUnifiedDiff now tracks "new file
mode" / "deleted file mode" from diff headers so the placeholder content
is correct even when git diff --stat fails (e.g. non-English locale).
Also clarifies locale assumptions in binaryStatRe and binaryFilesRe
comments and uses binaryChangeKind for accurate create/delete detection
when stat parsing succeeds.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class binary diff detection/representation so binary file changes render as a single placeholder line (optionally enriched with a human-readable size delta), rather than being skipped or producing unreadable output.

Changes:

  • Introduces DiffLine.IsBinary and a shared BinaryPlaceholder constant to reliably mark binary placeholders independent of display text.
  • Enhances ParseUnifiedDiff to detect Binary files ... differ and to label new/deleted binary files based on diff headers.
  • Enriches binary placeholders in Git.FileDiff with size delta info parsed from git diff --stat, plus new/modified/deleted classification via git diff --summary, with added fixtures/tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
diff/diff.go Adds IsBinary, binary detection in unified diff parsing, and size-delta enrichment for binary diffs.
diff/fallback.go Marks disk-detected binary placeholders with IsBinary and uses the shared placeholder constant.
diff/diff_test.go Adds coverage for binary parsing, size formatting, and end-to-end binary FileDiff behavior.
diff/fallback_test.go Extends fallback binary test to assert IsBinary is set.
diff/testdata/binary.diff Adds a fixture representing a binary unified diff line emitted by git.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return ParseUnifiedDiff(out)
lines, err := ParseUnifiedDiff(out)
if err != nil {
return nil, err
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FileDiff, ParseUnifiedDiff errors are returned without any additional context (e.g., which file/ref was being parsed). This makes failures harder to debug compared to other call sites that wrap errors with context; consider wrapping the error with the file/ref info ("parse diff for %s" etc.).

Suggested change
return nil, err
return nil, fmt.Errorf("parse diff for file %s (ref %q): %w", file, ref, err)

Copilot uses AI. Check for mistakes.
diff/diff.go Outdated
Comment on lines +189 to +199
func (g *Git) binaryChangeKind(ref, file string, staged bool) binaryChangeKind {
args := g.diffArgs(ref, staged)
args = append(args, "--summary", "--", file)

out, err := g.runGit(args...)
if err != nil {
return binaryChangeModified
}

return parseBinaryChangeKind(out)
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binaryChangeKind returns binaryChangeModified when git diff --summary fails. When git diff --stat succeeds but --summary fails, this can overwrite an already-correct "(new binary file)" / "(deleted binary file)" header-derived label with a misleading "(binary file: 0 B → N B)" description. Consider preserving the header-derived kind, or using a safer fallback (e.g., if summary fails, don't override the placeholder content, or infer kind from the existing placeholder metadata).

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +26
// BinaryPlaceholder is the content used for binary file placeholders.
// ParseUnifiedDiff returns this when git reports "Binary files ... differ".
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BinaryPlaceholder comment says ParseUnifiedDiff returns this when git reports "Binary files ... differ", but ParseUnifiedDiff can also return "(new binary file)" / "(deleted binary file)" for the same git message when it sees "new file mode" / "deleted file mode". Updating this comment to reflect those cases would prevent future confusion.

Suggested change
// BinaryPlaceholder is the content used for binary file placeholders.
// ParseUnifiedDiff returns this when git reports "Binary files ... differ".
// BinaryPlaceholder is the generic content used for binary file placeholders.
// ParseUnifiedDiff may return this for "Binary files ... differ", or the more
// specific "(new binary file)" / "(deleted binary file)" when git also reports
// "new file mode" / "deleted file mode".

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean implementation, tests pass, linter clean, no scope creep. good graceful degradation for non-English locales.

one thing to consider: the binary detection itself is ~10 lines. the size enrichment (1.0 KB → 2.0 KB) adds ~150 lines of code + 2 extra git commands per binary file. the size info is useful but I wonder if it's worth the complexity. smth to think about, not a blocker.

issues:

  1. parseBinaryStat, parseBinaryChangeKind, formatBinaryDesc, formatSize are standalone functions but only called from Git methods. should be methods on Git to match the rest of the codebase
  2. --stat and --summary run as separate git commands. could combine into git diff --stat --summary to save one process spawn per binary file
  3. no direct unit test for parseBinaryChangeKind, only tested through integration tests. the other helpers have table-driven tests, this one should too

@melonamin
Copy link
Copy Markdown
Contributor Author

the binary detection itself is ~10 lines. the size enrichment (1.0 KB → 2.0 KB) adds ~150 lines of code + 2 extra git commands per binary file. the size info is useful but I wonder if it's worth the complexity.

The use case I'm targeting with this feature - as a part of a PR agent optimizes assets size, I want to be able to quickly check it visually, as a part of PR review process. Comes from past experience, I think in this case the trade-off is fair.

Copy link
Copy Markdown
Owner

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the use case explanation and addressing all the points. LGTM.

one minor thing from Copilot that's worth considering (not a blocker): if git diff --summary fails but --stat succeeds, parseBinaryChangeKind defaults to modified which could overwrite a correct header-derived "new/deleted" label. low probability but worth a defensive check.

@umputun umputun merged commit 713438b into umputun:master Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants